ISCTE-IUL

Considering soccer matches as complex systems facilitates the identiﬁcation of properties that emerge from the interactions between players. Such properties include the regularities and statistical patterns that characterize couplings and sets between players established during matches. Empirical studies on the statistical distributions of number of items (e.g., words in texts) have shown that these distributions follow scaling properties according to empirical laws known as Zipf-Mandelbrot. Here we investigate whether the (re)occurrence of pitch location of sets of players in asoccer match also obeys theseempirical laws. Data collectedfrom 10 soccer matches shows that, for most sets of players, this seems to be the case. Exceptions were found in particular types of sets, such as goalkeeper and goal, and left defender and right attacker from opposite teams. Rather than challenging the hypothesis that a Zipf-Mandelbrot law deﬁnes this system, these exceptions may be explained by the players conﬁguration design, which is a trait of human interaction within complex systems. This design expresses match strategy, before the team enters in such dynamical processes (the game).


Introduction
Several approaches have been used to study complex systems, including the identification of well-known complexity features Bar-Yam (2002); Kobayashi, Kuninaka, Wakita and Matsushita (2011); Juarrero (2000); Silva, Vilar, Davids, Araújo and Garganta (2016b). For instance, in social complex systems specific behaviors emerge from self-organization Schmidt and Richardson (2008) ;Passos, Araújo and Davids (2013) ;Passos, Davids, Araújo, Paz, Minguéns and Mendes (2011); Araújo and Davids (2016). Self-organization results in most cases from the interaction between multiple parts in a system. An interesting and extensively investigated aspect of self-organization is the emerging exchange of information (e.g. verbal and non-verbal communication and their statistical properties) between people interacting in a given system Riley, Richardson, Shockley and Ramenzoni (2011). Typically these communication processes are based on cooperative interactions such as synergistic relations, but also on confrontation, which is a non-cooperative type of interaction.
Here we investigated how these processes and interactions are expressed in team sports. In particular, we asked whether soccer matches have hallmark features of other complex systems. Moreover, we assessed the influence of predefined design on the cooperative and competitive interactions between players. Notably, explicit inter-dependency between player-opponent behaviors is typically observed. The communication processes between players in soccer matches are often visually based and reflected in the players' moves and interpersonal spatial relationships Schmidt and Richardson (2008); Riley et al. (2011); Schmidt and Fitzpatrick (2016); Ramos, Lopes, Marques and Araújo (2017b); Silva, Chung, Carvalho, Cardoso, Davids, Araújo and Garganta (2016a). A key feature of this self-organized behavior is how players become synchronized by maintaining a perceptual link granted by their spatial proximity. The proximity-based sets thus formed may have different dimensions, both in terms of the number of players in the set and how they are organized (i.e., the team each player belongs to). According to hypernetwork theory, each player set with their corresponding inter-relationships form a simplex -plural, simplices -of players, representing the n-ary spatial interactions between at least two spatially connected players Johnson (2005b); Ramos, Lopes, Marques and Araújo (2017c). Figure 1 shows the simplices found at a particular time frame = 00 ∶ 10 in a soccer match.
In the present study, each simplex is represented by a spatial convex hull enclosing the players in the set. For instance, simplex 35 , is composed of players 3, 4, and 11 from team A (blue) and players 18 and 20 from team B (red), thus forming a 3 vs. 2 simplex Johnson (2013); Ramos et al. (2017c). By using the temporal aggregate of the geometrical center for each simplex convex hull, a spatial histogram for that simplex can be obtained. Figures 2a and 2b illustrate two of these histograms using spatial heat maps, for simplex 1 and 35 , respectively. Simplices 1 and 5 represent the particular type of relationship, that exists between a player, the goalkeeper and the goal.
We asked whether soccer match histograms exhibit the scaling properties of other human and natural phenomena typically described by power law type models. These power laws are common signatures of chaotic processes which at one point, become self-organized, as it happens in many natural and social systems. Typical examples can be found in population distributions of big cities Izsák (2006); Li and Wang (2019); Malacarne and Mendes (2000), forest fires Malacarne and Mendes (2000), forest patch sizes Saravia, Doyle and Bond-Lamberty (2018), scientific citations Malacarne and Mendes (2000); Silagadze (1997); Komulainen (2004), WWW surfing Malacarne and Mendes (2000), ecology Malacarne and Mendes (2000); Ferrer-i Cancho and Elvevåg (2010); Izsák (2006), solar flares Malacarne and Mendes (2000), economic index Malacarne and Mendes (2000), epidemics in isolated populations Malacarne and Mendes (2000), among others. The Zipf empirical law is an example of such power laws, and its generalization by Mandelbrot. In verbal communication processes, including natural language and written texts, several studies have shown that the frequency of word occurrence follows the Zipf power law. Indeed, the corpora of texts and languages have few words that are very frequent (e.g. "a", "the", "I", etc.) and many words which seldom occur. In Zipf's empirical law model, given the item (e.g., word) frequency, ( ), and order by their assigned rank, in decreasing order (rank 1 is for the most frequent word; rank 2 is for the second most frequent word, etc Ferrer-i Cancho and Elvevåg (2010)), occurrence frequency decays linearly as the rank increases on a double logarithmic scale, as expressed in equation 1 The generalization of this law conducted by MandelbrotPiantadosi (2014) has a better fit to empirical data 2 Being a generalization, the latter is also referred to as the Zipf-Mandelbrot (ZM) law. Soccer performance features, such as goal scoring distribution, also exhibit these statistical regularities related to power laws. By computing goal distribution of several main league soccer championships such as Brazil, England, Italy and Spain, it was shown that while there are very few top-scorers, many players score only a few goals Malacarne and Mendes (2000). Rather than focusing on these performance metrics or on player individual behavior, we assessed interpersonal relationships as expressed by simplices' sets, which correspond to the meso-scale system properties. Questions addressed at this level typically concern processes Johnson (2005b); Ramos, Lopes and Araújo (2017a) and therefore aim to explain the mechanisms underlying particular simplices' set occurrence distributions Ramos et al. (2017b).
Several studies have asked whether Zipf's empirical law can be observed in purely random systems Ferrer-i Cancho and Elvevåg (2010); Wentian (1992) by investigating the processes that may drive these particular statistical distributions. We addressed a similar question from a different angle: by analyzing the co-design expressed in the match strategy. Despite the uncertainty of human collective behavior, and therefore of the impossibility to predict the future state of complex systems, the deliberate design of the social structures forming a system may promote specific desired behaviors Johnson (2013Johnson ( , 2010; Blecic and Cecchini (2008); Johnson (2008). This design is in most cases a collaborative or cooperative process Johnson (2005a).
In soccer matches, one can consider that the artificiality (i.e. the design expressed via strategic behaviors) of these social complex systems is related to specific outcomes Johnson (2013Johnson ( , 2008Johnson ( , 2010. A very relevant aspect of the coaching process in team sports is the implementation of the design Rothwell, Davids and Stone (2018). A particular challenge in studying soccer matches as social systems is that the design results from both cooperative and competitive interactions Blecic and Cecchini (2008). The most frequent simplices, which are those that seem to persist over the entire match, must therefore be a consequence of this design (i.e., the strategy of the team).
The theoretical prediction that a macro, strategic, level of organization influences the micro, local, behavior and vice -versa was already highlighted in sport sciences Araújo and Davids (2016); Ribeiro, Davids, Duarte Araújo, Silva and Garganta (2019), but without a clear empirical demonstration of its effect.
When a team distributes their players in the pitch (i.e., the team strategy; considering attacking, defending, midfielders, goalkeepers and left, right or in the center of the pitch), they naturally stand near their symmetric opponents. For instance, the right attacker from team A vs. the left defender from team B. As these positions must be maintained during most of the match, sets of opposing players from team A vs. team B (e.g., 1 vs. 1) occur frequently throughout the match. These sets may also depend on the pitch area, such as: i) the simplex set <Goalkeeper, Goal > near the goal, corresponding to players with a very specific and narrow purpose Blecic and Cecchini (2008); ii) the defending team trying to have numeric supremacy closer to its box (e.g., simplex 32 in Figure 1) Blecic and Cecchini (2008).
The main questions addressed here refer to the simplices' set occurrence distribution. Specifically, we ask whether the Zipf-Mandelbrot law (ZM) fits the empirical distributions from soccer matches and if design (i.e. match strategy) has an impact on the simplices' set statistical distributions.

Raw data: players' coordinates
We analyzed two-dimensional displacement coordinates from 22 players provided by PROZONE (now STATS STATS (2020a)) for 10 matches (five at home and five away) of a focus team (team A) during the 2010/2011 English Premier League Season. This data was obtained via a PROZONE semi-automatic tracking system based on multiplecamera analysis. The position of the 22 players during the match was estimated based on the synchronized video files from eight cameras placed on the top of the stadium operating at a frequency of 10Hz (i.e., 10 frames per second, producing about 54000 frames per match) STATS (2020b). The player substitutions and sent-offs were also considered using ancillary descriptions of the match, e.g., commentary metadata.

Building of simplices' sets and heat maps
For each frame, the 22 players in the pitch and the two goals were organized in sets (simplices sets), according to the computational procedure adopted by Ramos and colleagues Ramos et al. (2017c). The criteria for selecting the players (or goals) for each set were based only in spatial proximity. The two goals were also considered in the simplices formation, as they act as special spatial references to the players, namely the goalkeepers. Figure 1 illustrates the player and goal positions at a particular time frame. Players and goals in the same simplex set are connected within their convex hull. Each simplex is uniquely defined by its index, and by its element set, , such that: = ⟹ = . For each frame, , Σ is the set of all simplices' sets that are found in that frame.

ZM model, ranking and bootstrapping the simplices sets
Zipf and Mandelbrot empirical laws relate token values and their rank using, equation 1 and 2, respectively. These laws can also be expressed by a probability density function 3.
Where , is the probability density value for token with rank under parameter set = { , }. The value of is given by equation 4.
The upper limit of the summation in equation 4 is , which is the number of different simplices observed in the entire match. On the other hand, given that in this study we also investigate the impact of design in the most frequent simplices sets we also use left truncation in the generalization of these probability density functions. Correspondingly, the summation lower limit, , defines the rank used to left truncate the distribution. This represents a generalization to the ZM model where the most frequent simplices are not considered. For example, if = 1, all simplices sets are considered, and if = 3 the two most frequent simplices sets are not considered. This generalization allows extending the model to the description of systems where the most frequent tokens do not follow the same mechanisms or are under different constrains from other tokens. In the particular case of a soccer match this can correspond for example to particular functions or rules.
Using a counting process, computed over the entire match, we obtained the frequencies for each simplex set, . These frequencies were used to rank the simplices. This scoring process is defined in equations 5 and 6 where is the number of samples in the match and ( ) is an indicator function given by: The total number of simplices' counts, is given by: where is obtained from equation 5 and the summation upper and lower limits are the same as in equation 4. To avoid the possible artifacts resulting from using the same data set for both ranking and frequency value described by Piantadosi Piantadosi (2014) we utilized a bootstrapping procedure similar to that proposed by Piantadosi Piantadosi (2014). This process is also used for defining confidence intervals for the frequency values Babu and Bose (1988) and for assessing the ZM law fit to the empirical data.

Fitting and validating the ZM distribution model
The analysis of these data structures related to the ZM distribution in real-life situations, implies an effective fitting procedure and an appropriate test for the goodness of fit Izsák (2006). The estimation of the unknown parameters, and for equation 3 that best fit the distribution to the empirical data set can be obtained by applying a Maximum Likelihood Estimation (MLE). In this method the estimation is performed by maximizing the likelihood of the distribution (i.e., the ZM distribution for the occurrence of each simplex) with these parameters given a particular data set (i.e., the number of occurrences of each simplex observed in the data set). How probable the distribution can be assessed via its likelihood estimator , the value to be maximized. For the ZM distribution the likelihood estimator is given by equation 8: In order to ease the maximization computational process usually the logarithm of is preferred, , and given by: In both equations, is the number of different simplices observed, is the starting index to be considered, and is the observed number of occurrences for simplex with rank .
The estimation of the values for parameters and that minimize − is performed using the numerical minimization provided by Octave's package function ℎ via the Nelder & Mead Simplex algorithm Nelder and Mead (1965). Parameter is obtained from equation 4.
To test the validity of the model we used the 2 metric for assessing its goodness of fit Baker and Cousins (1984); Bentler and Bonett (1980); Spiess and Neumeyer (2010). Although the − obtained from the 2 statistic was used to decide whether the hypothesis should be rejected, we nevertheless show the 2 ∕ value, as it does not depend on the sample size and to which "rule of thumb" 2 ∕ < 1 can be applied.

Results and Discussion
The figures below show the simplices formation results from three matches (selected out of 10 soccer matches) opposing team A against team B (Figure 3), team A against team C ( Figure 4) and team A against team H ( Figure  5). In figures: 3a, 4a, and 5a, we plotted the simplices' set relative frequencies versus the rank from the observed data. The gray area in these figures is obtained via a bootstrapping process where the limits correspond to the 10% and 90% percentiles (as described above). The red and blue lines in sub-figures 3a, 4a and 5a correspond to the values obtained from the ZM model, for = 1 and = 3, with parameters , and estimated via Maximum Likelihood Estimation (MLE) Izsák (2006).
In figures 3b, 4b and 5b we plotted the 2 ∕ metric for assessing the goodness of fit Baker and Cousins (1984); Bentler and Bonett (1980); Spiess and Neumeyer (2010) of the ZM model. We opted for plotting the 2 ∕ instead of 2 as it is easier to identify the 2 ∕ < 1 rule of thumb criteria for not rejecting the hypothesis.) This metric was computed according to expressions 8 and 9 and plotted against the value.  Table 1 we present the parameter values ( , and 2 ∕ ) of the Mandelbrot generalization for all the 10 matches considered. These results were obtained considering two different conditions: considering all the existing simplices ( =1), and removing the two most occurring simplices ( =3).
The results in Figure 3a suggest that the frequency versus rank follows a power law (we used as hypothesis the ZM model). However, the results shown in Figure 3b, where 2 is used to assess the goodness of fit of the ZM model, lead to different conclusions depending on how many simplices sets are considered. When all simplices sets are considered ( = 1, depicted in Figure 3a as a blue line) the ZM model hypothesis must be rejected. The high value of 2 results mostly from the most frequent simplices sets, which clearly do not follow a power law, as they form groups with very similar and high frequencies. Figure 3b shows that 2 decreases with the value, and that for ≥ 3 the ZM hypothesis should not be rejected. The red line in Figure 3a represents the ZM model resulting for this threshold ( = 3). A notable difference between the two thresholds ( = 1) and ( = 3) is the shift of the parameter from almost Zipf-like ( = 0.16978) to clearly Mandelbrot ( = 4.6029). Figures 3c to 3e show the spatial position heat maps for the simplices ranked 3 to 5 ℎ in Match B vs. A. They correspond to simplices sets formed by one player of team A and one player of team B (i.e., 1vs.1) located in particular zones of the pitch, along the side lines. Figures 4a to 4e show similar results for Match Team C vs. Team A. Figure 4b reveals the same threshold value, = 3, does not reject the ZM model hypothesis. This value is more significant for this match, as there is no substantial change in the 2 value above this threshold. Moreover, Figure 4a shows the same trend, with the red line exhibiting a much better fit to the observed data after the 2 most frequent simplex set. Again, a notable difference is found in the parameter between the two thresholds ( maps for the simplices ranked 3 to 5 ℎ in Match C vs. A. Two of these heat maps (4d and 4e) correspond to 1vs.1 simplices along the side lines. On the other hand, the 3 most frequent simplex set (4e) corresponds to an unbalanced set (two players from Team C and one player from Team A) and the spatial position of the heat map is more intense in the central zone of the pitch and close to Team's C goal. Figures 5a to 5e present the results for Match Team A vs. Team H. Figure 5b shows that the threshold value for not rejecting the ZM model hypothesis in this match is = 4, which is also clear in Figure 4a where the red line reveals a much better fit to the observed data after the 3 most frequent simplex set. Interestingly, the 3 most frequent simplex set also stands out from all the others. Moreover, as for the other matches, the parameter shifts from almost Zipf-like ( = −0.12456) to clearly Mandelbrot ( = 10.466) when using the ZM model for the two thresholds ( = 1) and ( = 3). Figures 5c to 5e show the spatial position heat maps for the simplices ranked 3 to 5 ℎ in Match A vs. H. Two of these heat maps (5c and 5d) correspond to 1vs.1 simplices along the side lines. The results for the 3 most frequent simplex set ( Figure 5c) further reveal the particularity of this set as observed above. Figure 5e corresponds to an unbalanced set (one player from Team A and two players from Team H) and the spatial position of the heat map is more intense in the central zone of the pitch and close to Team's H goal.
In the =1 table, the results for the values are closer to 0, which approximates to a Zipf's like distribution and the distribution begins to approximate a Mandelbrot distribution when the two most frequently occurring simplices are removed and the values are significantly higher.
In these three matches, removing the two most frequent simplices improves the goodness of fit, as shown in panels a) and b) of all figures. Moreover, the two most frequent simplices stand out from all the other simplices sets, not only because of their high frequency values, but also because when removed from the data set, the 2 values on the goodness of fit tests Baker and Cousins (1984); Bentler and Bonett (1980) are significantly reduced. It is also interesting to note that when these simplices are considered, the distribution is approximately Zipf (i.e., ≈ 0), however, in the opposite scenario the Mandelbrot generalization better describe the results (i.e., > 4.5).
Collectively our results reveal that, in the 10 soccer matches analyzed, the frequency of the overwhelming majority of the simplices that emerge follows the typical distribution of a complex system. The goodness of fit tests supports these findings and allows us to validate the null hypothesis postulating that the simplices frequencies follow a Zipf-Mandelbrot (ZM) like distribution Bentler and Bonett (1980).

Conclusions
In the present study we show that most of the simplices observed in the 10 soccer matches follow a statistical distribution of occurrence typical of complex systems. This result is supported by goodness of fit tests on the hypothesis of Zipf-Mandelbrot (ZM) like distribution Bentler and Bonett (1980), which correspond to hallmarks of complex and self-organized systems Schmidt and Fitzpatrick (2016). Moreover we found that the two most frequent simplices stand out from other simplices sets in frequency values and in their impact on the ZM distribution parameter (from ≈ 0, Zipf, to > 4.5, Mandelbrot).
The players involved in the two most frequent simplices, the goalkeepers, have a very distinctive purpose (defending the goal) and specific rules, when compared with the other players. First, these simplices sets are of the type <Goalkeeper, Goal> and the design of the competition field is established with specific delimited areas in the pitch, as goalkeepers can touch the ball with the hands in this specific area. Second, the specific role of these players anchors the goalkeepers to their goals, to prevent the opposing team players from scoring a goal. Moreover, we can observe another typical feature of social complexity, namely, intentionality in the behavior of the actors Johnson (2013Johnson ( , 2010; Blecic and Cecchini (2008); Johnson (2008).
Notably, the results from some of the matches (e.g. A against teams B and H), reveal that simplices that seem to be designed, preplanned or conceived before the match, to behave differently from the others, i.e., where subsets of players are more frequently close to each other than the others (Figures 3a and 5a). For instance, in Match B vs. A which has mainly 1vs.1 simplices and a typical positioning in the field (figures c), d) and e)), during the entire match, it is clear that the players remained connected in a very specific area of the field, and a similar scenario can be observed in Match A vs. H (for the six first more frequent simplices). However, in Match C vs. A no simplices stand out from the rank distribution, with the exception of <Goalkeeper, Goal> simplices, which is also revealed in Figure b), where even after removing the <Goalkeeper, Goal> simplices the 2 values remain low (less than 0.5) and stable.
In conclusion, we found several 1vs.1 simplices, as well as close combinations 1vs.2 or 2vs.1 and also 2vs.2, that might reflect a preformed design and strategy of the teams. In addiction, we also found large number of sets of simplices that appear less frequently, revealing that many interactions between players are self-organized. The frequency distribution of simplices sets is well modeled by the ZM model, a hallmark of complex systems, with parameter in the range of other systems (e.g., written text, population size). However, large deviations from this model occurs for the most common simplices sets, revealing design -a well identified means to deal with complexity. This aspect is particularly relevant as it results not only from the traditional cooperative design Johnson (2005a), but in this case from both cooperative and competitive processes.

Data Availability
The raw data used in this study, i.e., the players' coordinates on the pitch, are available from City Football Services, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Secondary data, i.e., simplices' constitution at each time frame, are however available from the authors upon reasonable request.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.