Open Access
Issue
Natl Sci Open
Volume 2, Number 1, 2023
Article Number 20220043
Number of page(s) 17
Section Information Sciences
DOI https://doi.org/10.1360/nso/20220043
Published online 10 January 2023

© The Author(s) 2023. Published by China Science Publishing & Media Ltd. and EDP Sciences.

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

INTRODUCTION

An urgent challenge for artificial intelligence (AI) applications today consists of the following dilemma concerning data privacy: on the one hand, a large number of sophisticated algorithms have been proposed to broaden the applicability of AI to various applications such as medicine, manufacturing and more [1-3]; on the other hand, regulations such as general data protection regulation (GDPR) restrict data sharing, thus limiting the performance of AI algorithms [4]. As a result, models that are trained and evaluated on a limited amount of data since privacy could have biases [5]. This has become a well-known bottleneck in AI applications, especially medical AI [6].

Promising privacy-preserving methods such as federated learning can help maintain the performance of AI algorithms, while preserving the data stored locally [7]. Inspired by this, there has been a surge of interests in both the theory and applications of federated learning [8]. Federated averaging (FedAvg), the leading algorithm in the field of federated learning, was proposed in 2016 by researchers at Google [9, 10], and its variant, dynamic FedAvg, was proposed in ref. [11]. The studies [9-11] mentioned above were based on the analysis of first-order stationary points. The second-order optimality guarantee of federated learning was later established in ref. [12]. Through crowded efforts and comprehensive surveys [13-18], widely used federated learning methods were established, the challenges and related applications were introduced, and a large number of valuable research directions were outlined. A notable example has been demonstrated in a report by Kaissis et al. [19], in which a convolutional neural network was trained over the public Internet with encryption from medical images using a secure federated learning framework.

Despite these breakthroughs, classical federated learning algorithms have a major drawback: the need for a central client, which could cause privacy, communication, computation, and resilience issues [20]. Much effort, therefore, has been invested to reduce the communication and computation complexity as well as increase the robustness and resilience of centralized federated learning algorithms. In order to deal with the challenge of system constraints, the authors of ref. [21] applied sparse technology to reduce the communication and computation costs in the training process. In addition, the authors of ref. [22] proposed an optimal trade-off control algorithm between the local update and the global parameter aggregation in the resource constrained system. Recently, federated schemes were extended to a time-varying centralized scheme, where a changing leader is selected based on certain rules, which is a firm step to full decentralization [23].

The current research mainly discussed decentralized federated learning from three different perspectives. Firstly, the initial work [24] realized fully asynchronous and decentralized machine learning by introducing gossip protocol into deep learning. Furthermore, Daily et al. [25] proposed a random gradient descent algorithm based on GossipGraD communication protocol for extended deep learning algorithms on large-scale systems. In addition, Hu et al. [26] proposed a decentralized federated learning system based on model segmented gossip aggregation. Secondly, being specialized for graph convolutional networks, Scardapane et al. [27] proposed an algorithmic framework for distributed training considering the case that data were collected by a group of agents communicating through the sparse network topology. Similarly, Pei et al. [28] proposed a new decentralized federated graph neural network model based on peer-to-peer network structure and realized the decentralized learning with privacy protection based on the Diffie-Hellman key exchange method. For decentralized federated learning in general, similar to refs. [21, 22], the refs. [29, 30] proposed a decentralized distributed learning algorithm by introducing the concept of belief into the parameter space of the client model in which the client updates its belief by aggregating the information from one-hop neighbors to realize the optimal model of cooperative learning. A fully decentralized federated learning algorithm, IPLS, was proposed by Pappas et al. [31], which is based on model segmentation and partially based on the interplanetary file system. Based on the popular Time Slotted Channel Hopping radio interface of the IEEE 802.15.4e standard, Savazzi et al. [32] proposed a real-time framework for analyzing decentralized federated learning systems running on industrial wireless networks. Thirdly, in terms of decentralization via blockchain, similar to ref. [23], Wu et al. [33] proposed a decentralized federated learning framework that realized the decentralization of all aspects of system. The ref. [34] proposed a decentralized privacy protection in-depth learning framework based on blockchain, which uses the local credibility mutual evaluation mechanism to realize the research on collaborative fairness in in-depth learning. Yapp et al. [35] proposed a federated edge learning (FEL) framework with efficient blockchain authorization and communication, which enhances the security and scalability of FEL's large-scale implementation.

However, these methods suffer from one of the following drawbacks: (1) lacks of theoretical guarantee, making it unreliable for practical use; (2) lacks of temporal flexibility of communication topology; (3) suffers from potential leakage of private information. More specifically, the comparison between these existing stuides and the decentralized framework method proposed in this paper is shown in Supplementary Information Table S1. This paper proposes a principled fully decentralized federated learning algorithm (DeceFL), in which each client is guaranteed to achieve the same performance as the centralized algorithm in terms of training/test accuracy, when the global objective function is smooth and strongly convex. Empirical experiments have demonstrated that the same claim also holds for nonconvex global objective functions, thus revealing its potential to be applied to a wider class of applications such as those using deep learning. In addition to all features that other state-of-the-art federated learning and swarm learning algorithms [23] possess, DeceFL has additional desirable features beyond classical centralized or decentralized federated learning and swarm learning, namely: (1) full decentralization: at any iteration, there is no central client that receives all other clients’ information, therefore removing monopoly and preventing attacks; (2) principled algorithms: it has been proved that zero performance gap between decentralized and centralized algorithms can be achieved under certain condition, therefore guaranteeing global performance; (3) flexible topology design: any connected time-varying communication graph suffices to achieve the same performance, therefore allowing clients to decide when and who to communicate with subject to their physical constraints; (4) equal rights for all: all clients can have the trained model, therefore incentivizing clients to participate; (5) privacy-preserving: every client does not communicate its gradient to other client, therefore preventing data leakage, even there is adversary in the communication channel [36].

THE PROPOSED FULLY DECENTRALIZED FEDERATED LEARNING FRAMEWORK

Problem formulation

We first formulate the decentralized federated learning problem. Assume that there are K clients with local data in the form for standard machine learning tasks: Dk for k{1,2,,K}. The conventional training of AI models (denoted by M) can be formulated as the following global learning problem (let DkDk and kDk=): McargminMF(D;M). In such a centralized framework, clients have to send their datasets Dk to a central client. In contrast, in a decentralized framework, clients would like to collaboratively train a model Md using the same objective function F, in which the k-th client does not send its data Dk to others. We define the performance gap as a nonnegative metric, which quantifies the degenerative performance between a centralized model and a decentralized one: ΔF(D;Mc)F(D;Md). The goal is to make Δ as small as possible and in the ideal case to make Δ=0.

The proposed DeceFL algorithm

To solve the optimization problem in a decentralized way, we model the communication network between clients as an undirected connected 1) graph G=(N,E,W), where N:={1,2,,K} represents the set of clients, and EN×N represents the set of communication channels, each connecting two distinct clients. For each pair (i, j), the corresponding element in the adjacency matrix W, i.e., Wij, indicates whether there is a communication channel between the i-th client and the j-th client. Specifically, when Wij>0, there is information communication between clients i and j, while Wij=0 means none. For client i, when Wij>0, then client j is called a neighbor of client i. The set of all such clients j is represented as Ni, i.e., Ni={j|Wij>0,jN}. Define the local loss function Fk(w)F(Dk;M) as the user-specified loss function on the dataset Dk with model parameters w in M, then F(D;M) can be rewritten as F(w)F(D;M)=1Kk=1KFk(w). Let the client k hold a local copy of the global variable w, which is denoted by wkn. Specifically, the update rule of DeceFL is, for each client k=1, …, K, wk(t+1)=j=1KWkjwj(t)average of neighbors" estimates     ηtFk(wk(t))gradient descent,(1) where ηt>0 is the learning rate, and the initial condition wk(0)n can be arbitrarily chosen. Every client shares with its neighbors (which is a subset of all other clients) its model parameters rather than data. Specifically, every client runs its local training algorithm, e.g., gradient descent, and it only communicates its own estimate of the global parameter with its neighbors. Once a client receives other estimates from neighboring clients, it averages out other estimates, adds to its local gradient and generates its estimate in the next iteration. The above process will be repeated until convergence. As shown in Figure 1, in DeceFL, each client completes the update by receiving and transmitting directly with neighboring clients and local gradient calculation, without needing the aggregation and transmission of a third-party central client at any iteration. Thus, it is fully decentralized.

We stack the wk(t) and Fk(wk(t)) in eq. (1) into vectors, i.e., define w(t)=[w1(t)T,,wK(t)T]TKn and F(w(t))=[F1(w1(t))T,,FK(wK(t))T]TKn. Then, we can compactly rewrite eq. (1) as w(t+1)=(WIn)w(t)ηtF(w(t)),(2) where W=[Wij]K×K and Inn×n is the identity matrix. Next, we analyze the convergence of DeceFL under the following assumption about the global cost function, which is consistent with those made in the convergence analysis of FedAvg [37].

Assumption 1. For each k=1, …, K, assume that Fk is Lk-smooth and μk-strongly convex, where Lk, μk>0. That is, Fk is differentiable and the gradient is Lk-Lipschitz continuous, i.e., for any x,yn, Fk(x)Fk(y)Lkxy,(3) and Fk(x)Fk(y)+Fk(y),xy+μk2xy2.(4)

When Assumption 1 holds, the global objective function F() is L-smooth and μ-strongly convex, where L=max{L1,,LK} and μ=min{μ1,,μK}. Clearly, μL.

Assumption 1 is standard and satisfied by typical loss functions in machine learning, including l2- regularized linear and logistic regressions. In order to analyze the convergence of the algorithm, we define the average sequence w¯(t)=1K(1KTIn)w(t)=1Kk=1Kwk(t), where 1KK is a vector in which all elements are 1. Denote λ as the spectral norm of W1K11T, where λ(0,1) in algebraic graph theory (see Supplementary Information).

Theorem 1. Consider algorithm (1), where the learning rate is chosen by ηt=δt+Γ, in which δ>1μ and Γ>λ1λ satisfying δΓ1L. Denote the gap between the local function value at the initial point w(0) and local optimal value as ε0k=1K(Fk(wk(0))Fk(wk))0, where wk=argminwkFk(wk). Then the following inequality can be obtained under Assumption 1: w(t)1w¯(t)ζt+Γ,(5) and w¯(t)wζ˜t+Γ,(6) where ζmax{Γw(0)1w¯(0),δ2Lε0ΓΓ+1λ} and ζ˜max{Γw¯(0)w, 1μδ1LδζK}.

Proof. See Supplementary Information.

In ref. [38], the authors studied the convergence of FedAvg and established the convergence rate O(1/T) (where T is the number of iterations in gradient descent) for strongly convex and smooth problems. Here Theorem 1 guarantees the proposed DeceFL algorithm has the same convergence rate as FedAvg under time-invariant connected communication topologies. In addition, in the Supplementary Information, it has been further demonstrated that the proposed DeceFL converges even for time-varying topologies, in which case the underlying topology does not need to be connected for all iterations (as illustrated in Figure 1). Finally, FedAvg and SL can be viewed as special cases of DeceFL, the proof of convergence for DeceFL also warrants that of centralized federated learning and SL.

thumbnail Figure 1

Illustration of key concepts in different state-of-the-art federated learning frameworks. (A) Classical Federated Learning: a central client is needed to receive and transmit all essential information to other clients. It is equivalent to an all-to-all network without such a central center, i.e., every client in the network can receive information from all other clients. (B) Swarm Learning: there is no such a universal central client, but a potentially different central client is selected in every iteration. Mathematically, it is equivalent to FedAvg with varying central clients. (C) The proposed Decentralized Federated Learning: there is no need for a central client in any iteration. Any connected time-invariant or time-varying topology (under certain mild condition given in the Supplementary Information) has guaranteed performance, therefore unifying the classical federated learning and swarm learning.

An illustrative example

We use a simple yet important example to illustrate the problem and demonstrate the advantages of the proposed decentralized framework as compared with centralized learning, federated learning and swarm learning. Consider the case where K clients would like to compute the average of every client's private value Dk=wk(0), i.e., w*k=1Kwk(0)K. This task is simple if there exists a central client. However, it becomes challenging when a central client is not available. This is the well-known decentralized consensus problem [39-41], which has many engineering applications including synchronization, PageRank, state estimation, load balancing, and more.

We can convert the consensus problem to the following optimization problem minwF(D;w)12k=1K(wwk(0))2,(7) with its optimal value being coincided with w*. Rather than computing the mean, we convert it to solve the optimization problem (7) using algorithms FedAvg, SL and the proposed DeceFL respectively.

FedAvg algorithm

In the classical federated learning algorithm, i.e., FedAvg, there is a central server to collect the local parameter information of each client for average aggregation, and to assign it to each client, which is equivalent to a complete graph of K clients without a central server after simple derivation (Figure 1A): wk(t+1)=1Kk=1Kwk(t)ηt(wk(t)wk(0)),(8) where t represents the t-th iteration, wk(t) represents the estimate of global optimum for the k-th client at iteration t, and ηt is learning rate in the gradient descent algorithm. In essence, every client iteratively updates its estimate based on all others’ estimates together with its current gradient. Using derivation in the Supplementary Information based on dynamical system and algebraic graph theory, it can be shown that the system reaches the steady-state, i.e., limtwk(t)w*=0, if ηt=γt+Γ and γΓ<1 is satisfied for any γ,Γ>0.

SL algorithm

The SL algorithm considers the situation where there is no central server: in each iteration, a random leader is dynamically selected from the clients to aggregate the model parameters from all clients (including itself) and assign them to each client as shown in Figure 1B. Mathematically, this is exactly the same as centralized federated learning as shown in the Supplementary Information.

DeceFL algorithm

Each client communicates parameters through the topology of the undirected connected graph shown in Figure 1C. According to the weighted aggregation of information obtained from neighboring clients, the local update is completed according to the following iteration: wk(t+1)=j=1KWkj(t)wk(t)ηt(wk(t)wk(0)),(9) where W(t)K×K is the weighted matrix of the undirected connected graph at iteration t. Using derivation in the Supplementary Information based on dynamical system and algebraic graph theory, it can be shown that the system reaches the steady-state, i.e., limtwk(t)w*=0, if ηt=γt+Γ and γΓ<1σ are satisfied2) for any γ, Γ>0.

Summary

It can be shown that all methods can reach consensus with zero performance gap, i.e., Δ=0. The convergence speeds of all methods are O(1/T). As shown in the Supplementary Information Figure S1, these methods converge to the consensus value exponentially. This is consistent with the theoretical results. The information used in three different algorithms is however distinct: FedAvg and SL need global information at every iteration, while DeceFL only needs local information from clients’ neighbors. In addition, the first two algorithms can be viewed as special cases of the proposed DeceFL algorithm by setting W to the corresponding adjacency matrix correspondingly. Specifically, when W(t)=1k1K1KT for all t, the proposed DeceFL algorithm becomes FedAvg and SL.

RESULTS

Experiments were carried out on real-world biomedical and industrial applications, which demonstrated the effectiveness and the wide applicability of DeceFL, as a fully decentralized framework. We benchmarked the performance of DeceFL, in comparison with FedAvg and SL that demand much more communication costs and strongly rely on a highly restricted communication topology. Furthermore, the superiority of DeceFL on robustness in the presence of communication topology interference (random node or edge malfunction) was shown in two experiments with time-varying communication topologies. The overall performance of DeceFL was corroborated by these practical applications.

Application to medicine: collaborative prediction of leukaemia

First, we used the dataset of peripheral blood mononuclear cell (PBMC) transcriptomes from ref. [23], named “dataset A2”, as a benchmark example to compare three federated learning frameworks: DeceFL, FedAvg and SL. The experiment setup is consistent with that of SL: the dataset is divided into a training set and a test set at the ratio of 8:2, and the dataset owned by each node was obtained from the training set. The logistic regression model with l2 regularization and the 8-layer fully connected deep neural network (as in ref. [23]) were selected for our experiments detailed in the MATERIALS AND METHODS Section.

First, we benchmarked DeceFL against FedAvg and SL in the IID setup of dataset A2 (Figure 2A), that is, the sample size of each node is the training set sample size divided by the number of nodes, which ensures that each node has the same number of samples and the ratio of the positive to the negative samples is approximately 1∶1. DeceFL used multiple connected communication topologies with various connectivity probability values (p=0.3, 0.5, 0.7, 0.9). This benchmark shows that DeceFL can reach the same performance as FedAvg and SL which use a (temporary) central client to gather all information from every client. FedAvg and SL only perform better during the transient period that DeceFL takes to converge due to its decentralized nature. Whereas DeceFL relaxes significantly on the choice of communication topology, which, in contrast to FedAvg and SL, no longer requires a fixed topology and thus has higher flexibility to deal with various demands from clients (e.g., one client could refuse to directly communicate with certain clients). Second, the similar comparative study was repeated with the Non-IID setup of dataset A2 (Figure 2B). The Non-IID setup explicitly designs, for the local data associated with each node, the sample size and the ratio between positive and negative samples (Supplementary Information). It allows us to benchmark performances on balanced/unbalanced, sufficient/deficient local training data. We obtained very similar results as in the IID setup, where DeceFL presents a similar performance as FedAvg and SL, after DeceFL reaches consensus in decentralized computation. It also shows the superiority of DeceFL over SL, which however demands huge amounts of communication costs for the selected central client at every round and relies heavily on the strong assumption of a stable fully-connected communication structure. Any bit of malfunction of clients or communication paths could melt down the whole SL process, since at each round a client is delegated to collect information from all other clients.

To show DeceFL functions well when an intervention to decentralized infrastructure happens, we conducted two experiments with time-varying graphs that take into account malfunction of clients and communication paths. First, communication topology was altered in runtime (Figure 3A), that is, the adjacency matrix that describes how clients communicate with each other varied over time. This time-varying experiment further shows that the conditions of DeceFL can even be weakened and generalized to such an extent that the communication graph at each time is not necessary to be connected as long as within a fixed period the information can be transmitted between any pair of clients. To calibrate the performance of DeceFL, we assume the ideal communication topology (no structural intervention) in the run of FedAvg (that is, a central node with consistent connections to every client). Surprisingly, both experimental results Figure 3C (IID) and Figure 3D (Non-IID) show that DeceFL in such a scenario keeps similar performance as FedAvg (with an ideal setup). In other words, DeceFL can be so robust that random malfunction of a portion of edges may merely deteriorate DeceFL running processes.

As another source of structural intervention, the second experiment considers the removal and supplement of nodes, which is a significant issue of federated learning in practice and in business. In a long-running and data-intensive machine learning project including hundreds of clients, that clients may drop out or new clients may join in the middle must not deteriorate or break such a time-consuming computation process. As shown in Figure 3B, during the first 50 rounds, we used an Erdos-Renyi graph of 6 nodes; then for the 51–100 rounds, the graph was of 8 nodes by adding 2 extra nodes; and in the rest rounds, the graph was randomly removed by 2 nodes. Under such node interventions, experimental results Figure 3E and 3F show robust performance of DeceFL which is similar to FedAvg (that does not consider node interventions). Two experiments manifest the robustness of DeceFL on interventions of computation infrastructure in a decentralized framework.

thumbnail Figure 2

DeceFL to predict leukaemia from A2 benchmark dataset [23]. (A) Data were divided into IID samples for all clients. (B) Data were divided into Non-IID unbalanced samples. (C) Different topologies for FedAvg, SL and DeceFL respectively. The topology for SL must hold in every iteration when any other node is selected as a central client. (D)–(G) Performance of three algorithms on IID/Non-IID setups over logistic regression/neural networks. DeceFL presents a similar performance as FedAvg and SL in both IID and Non-IID scenarios, except during the transient period that DeceFL takes to reach consensus. In contrast to fixed topology in FedAvg and SL, DeceFL provides much more freedom on the choice of communication topology.

thumbnail Figure 3

DeceFL to predict leukaemia from A2 benchmark dataset [23]. (A) Time-varying communication topology that consists of a sequence of graphs each of which is not connected while the lump-sum graph over a fixed period is connected. (B) Time-varying communication topology that adds or removes nodes over time. (C), (D) Performance of DeceFL with edge-varying graphs on the IID and Non-IID setups of dataset A2 using logistic regression, with reference performance of FedAvg that uses full information. (E), (F) Performance of DeceFL with node-varying graphs on the IID and Non-IID setups of dataset A2 using logistic regression, with reference performance of FedAvg that uses full information. These time-varying experiments manifest the robustness of DeceFL in the presence of interventions to communication topology. The dropout or supplement of clients in the middle of DeceFL computation will not break or deteriorate the learning process, which is particularly essential in practice for large-scale time-consuming data-intensive applications.

Application to smart manufacturing: collaborative detection of bearing faults

Another application of DeceFL lies in the manufacturing sector: modern manufacturing is heavily influenced by AI technologies with extraordinary increase of computational power and data size. To raise productivity and reduce operational costs, a critical challenge is fault diagnosis in machining operations [42]. AI-based algorithms have the potentials to detect fault locations and even to predict faults in advance, which allow replacing regular maintenance with real-time data-driven predictive maintenance and further reduce unnecessary maintenance costs and guarantee reliability. A general fault detection framework has been proposed in ref. [43], which, however, needs full-cycle measurements of large amounts of machines that are most likely unavailable for a single factory. Data generated by multiple factories could be sufficient to perform preventive maintenance, while sensitive data (security or business related) are less likely to be shared in practice. The fully decentralized framework DeceFL provides a way for multiple factories to develop a global model, which generates mutual benefit from private local data without having to resort to data sharing in public.

This experiment practices such a decentralized fault diagnosis application in manufacturing, using Case Western Reserve University's (CWRU) bearing data, which comprises ball bearing test data for normal and faulty bearings, specified in the METHODS Section. Specifically, we used three types of bearings data: 7 inches, 14 inches and 21 inches; and chose the drive end defects, which includes outer race defect, inner race defect, and ball defect. We chose the outer race defect appearing at the 6 o’clock (centered) position. Thus, there are in total ten distinct conditions: 9 faulty classes (3 bearing types times 3 defect types) and the normal condition. All data in use was collected at 12,000 samples/second for drive end bearing experiments. It was generated by using 4 types of motor speed: 1797 r/min, 1772 r/min, 1750 r/min and 1730 r/min. The data from 1730 r/min is reserved for test.

Assume that there are 4 factories, as clients illustrated in Figure 4C, which collect their private full-cycle bearing data. The training data associated with each client were prepared in the IID (Figure 4A) and the Non-IID setup (Figure 4B). A 10-way classification problem is considered, 9 fault cases (B007, IR007, OR007, B014, IR014, OR014, B021, IR021, OR021) and 1 normal case. Learning used two methods, regularized logistic regression as a strongly convex method, and deep neural network (DNN) as a nonconvex method. In the usage of logistic regression, as guaranteed in theory, DeceFL in Figure 4D and 4F confirms the similar performance as FedAvg after its transient periods. For the case of DNN, as a non-convex method, although there is no theoretical guarantee, DeceFL in Figure 4E and 4G shows competitive performance to FedAvg. The slight performance gap in test between DeceFL and FedAvg in Figure 4F may be mostly caused by the chosen type of DNN, multilayer perceptrons as used in ref. [23], which has many well-known defects; and more reasons are discussed and explored by more experiments in the Supplementary Information. Comprehensive experiments were conducted and can be found in the Supplementary Information, with more clients (that is, more factories), more learning methods, and another 4-way classification problem. Overall DeceFL manifests competitive performance on multi-class classification for industrial fault diagnosis applications, with implementations of (non-)convex methods in a fully decentralized framework that breaks through the barrier of data privacy.

thumbnail Figure 4

DeceFL to detect bearing faults from CWRU benchmark dataset. (A) Data were divided into IID samples for all 4 clients. (B) Data were divided into Non-IID unbalanced samples. Each client locally specified its data size and sample distribution. (C) Illustration of communication topology for FedAvg, SL and DeceFL. (D), (E) Performance of DeceFL on IID data using logistic regression and DNN, respectively, with reference performance of FedAvg and SL. (F), (G) Performance of DeceFL on Non-IID data using logistic regression and DNN, respectively, with reference performance of FedAvg and SL. (D)–(G) The boxplots at bottom illustrate the performance comparison between DeceFL and each client that was trained independently (that is, each client trained its own model only using its associated local data without communicating with any other clients). The experiments on CWRU benchmark dataset confirms the similar performance of DeceFL, as a fully decentralized framework, in comparison to FedAvg and SL, for industrial datasets, while DeceFL offers more freedom on the choice of communication topology to handle practical issues in organizing clients for large-scale federated learning applications.

DISCUSSION

In this paper, we propose a decentralized federated learning framework for privacy-preserving collaborative learning. The decentralized architecture eliminates a large number of drawbacks—due to having a central client—that state-of-the-art privacy-preserving algorithms have. The convergence of the DeceFL algorithm is analyzed in detail, showing that DeceFL guarantees convergence and has the similar convergence rate as the centralized federated learning algorithm. The convergence performance of the algorithm is verified by training neural networks over different datasets. Compared with other state-of-the-art privacy-preserving algorithms such as FedAvg and SL [23], the proposed DeceFL algorithm is guaranteed to reach the global optimum with a similar rate as the centralized federated learning algorithm under certain conditions. In addition, there has developed a sizable literature as surveyed in ref. [44], which can be adapted to cope with quantization errors and noises that could happen over communication networks.

There is no doubt that such a decentralized federated learning framework will become increasingly popular in the nearest future for almost all AI applications given the privacy regulations. Yet our algorithm has a number of limitations that need to be taken into consideration for future development.

First of all, application of privacy algorithms (for example, blockchain or homomorphic encryption [45]) has not been considered in this study. However, similar to the centralized federated learning and swarm learning framework, it should be straightforward to apply such techniques for data privacy protection in the proposed DeceFL framework to make communication secure.

Secondly, similar to the setup of federated learning and swarm learning, all clients in the network need to know the form of the global objective function. The proposed formulation is different from those used in multi-party computation [46], where model and data can be separated [47]. Future work lies in the integration of such techniques to the proposed DeceFL algorithm to make the global objective function unknown to clients.

Finally, all clients are assumed to be collaborative, it would be interesting to further investigate whether the proposed decentralized federated learning framework is vulnerable to semi-honest or malicious clients that are not collaborative [48], which could exist in the real-world applications.

MATERIALS AND METHODS

Data pre-processing

Essential data preprocessing was performed for CWRU dataset, including class balancing and normalization, feature extraction by Fourier transform. The dataset has 10 classes in total, which vary in sample size. Hence samples in certain classes were deleted to balance sample sizes over all classes. The original data is time-series data, which was firstly divided by every 300 points and resulted in a family of time series. Each time series chose DE and FE features respectively, and produced 600 points. For every time series of each feature, we performed fast Fourier transform (FFT), which yielded 150 points. Thus each time series of both DE and FE has in total 300 points. The motivation to use FFT is to handle the mismatch of time stamps of sequential data. After FFT the training and test data were then normalized by removing the mean and scaling to unit variance (the test data are normalized by the normalizer of the training).

Performance metrics

The common performance metric “accuracy” is used for assessment of classification, accuracy=TP+TNTP+TN+FP+FN(10) where TP, TN, FP and FN denote the number of true positive, true negative, false positive and false negative samples, respectively.

Implementation of learning methods

To ensure in benchmark DeceFL can work well for either strongly convex learning methods or non-convex methods, we adopted two algorithms: logistic regression with l2 regularization, and deep neural network (DNN), as used in ref. [23]. For logistic regression, every node runs 1 epoch in each communication round, with batch-size 64. It uses the SGD optimizer, with weight decay coefficient 10-4 for the realization of l2 regularization. The learning rate is fixed to 0.001. For DNN, every node runs 5 epochs (dataset A2) or 10 epochs (CWRU dataset) in each communication round, with batch-size 64. It uses the SGD optimizer, with weight decay coefficient 10-4. The initial learning rate is 0.1, which is decayed by multiplying 0.1 every 20 communication rounds. This DNN has 8 hidden layers, whose dimensions are 256, 512, 512, 256, 256, 128, 128, 64, respectively. The dropout rate is set to 0.3. Most experiments use softmax as the activation function in the output layer for classification, except the logistic regression for dataset A2 uses sigmoid. The total number of running rounds is selected by visualization effects of convergence for all methods in comparison.

Data availability

The peripheral blood mononuclear cell (PBMC)-derived transcriptome dataset, named as “dataset A2” in ref. [23], was used, which was originally generated with Affymetrix HG-U133 2.0 microarrays (8348 individuals), by inspection of all publicly available datasets at National Center for Biotechnology Information (NCBI) Gene Expression Omnibus. To perform the IID experiments, the initial preparation of dataset randomly dropped negative samples, resulting in 5176 samples, such that the whole dataset is balanced, i.e., the ratio of the positive to the negative samples is 1∶1. CWRU Bearing dataset refers to the data of ball bearing test for normal and faulty bearings from the Case Western Reserve University, available on https://engineering.case.edu/bearingdatacenter. All source codes are openly available on GitHub (https://github.com/HAIRLAB/DeceFL).

Acknowledgments

We thank Mr. Anthony Haynes, Mr. Cai Huang for editing.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Nos. 92167201, 52188102, 62133003, 61991403, 61991404, and 61991400) and Jiangsu Industrial Technology Research Institute (JITRI)

Author contributions

Idea was conceived by Y.Y. Theory was developed by J.L., R.C., L.X., X.Y., T.Y., Y.Y. Simulation codes were developed by D.J., C.S., M.W. and were reviewed by Z.Y. Experiments were designed by Y.Y., Z.Y., and were performed by D.J., Z.Y., C.S., M.W., Y.Y., F.H., R.C. Projects were supervised by Y.Y., S.S., H.D. Funding was acquired by Y.Y., J.L., H.Z., H.D. The original draft was written by Y.Y., Z.Y., J.L., R.C., X.Y. and all authors provided critical review of the manuscript and approved the final draft.

Conflict of interest

The authors declare no competing interests.

Supplementary information

The supporting information is available online at https://doi.org/10.1360/nso/20220043. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.


1)

In this work, we consider that the information communication between clients is mutual for notational simplicity; therefore, the adjacency matrix W=[Wkj]K×Kis symmetric. Furthermore we assume that the underlying topology is connected, i.e., for any two clients k and j, there is at least one path from k to j.

2)

Let us sort the eigenvalues of W in a non-increasing order as 1=λ1(W)>λ2(W)λn(W)>1, and denote σas the second largest eigenvalue of the weighting matrix W, i.e., σ=max{|λ2(W)|, |λn(W)|}(0,1).

References

  • Lecun Y, Bengio Y, Hinton G. Deep learning. Nature 2015; 521: 436–444. [Article] [CrossRef] [PubMed] [Google Scholar]
  • Yuan Y, Tang X, Zhou W, et al. Data driven discovery of cyber physical systems. Nat Commun 2019; 10: 4894. [Article] [NASA ADS] [CrossRef] [PubMed] [Google Scholar]
  • Yan L, Zhang HT, Goncalves J, et al. An interpretable mortality prediction model for COVID-19 patients. Nat Mach Intell 2020; 2: 283–288. [Article] [CrossRef] [Google Scholar]
  • Price Ii WN, Cohen IG. Privacy in the age of medical big data. Nat Med 2019; 25: 37–43. [Article] [CrossRef] [PubMed] [Google Scholar]
  • DeGrave AJ, Janizek JD, Lee SI. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat Mach Intell 2021; 3: 610–619. [Article] [CrossRef] [Google Scholar]
  • Roberts M, Driggs D, Thorpe M, et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell 2021; 3: 199–217. [Article] [CrossRef] [Google Scholar]
  • Konečný J, McMahan B, Ramage D. Federated optimization: Distributed optimization beyond the datacenter. ArXiv: 1511.03575. [Google Scholar]
  • Yang Q, Liu Y, Chen T, et al. Federated machine learning. ACM Trans Intell Syst Technol 2019; 10: 1–19. [Article] [Google Scholar]
  • Konečný J, Brendan McMahan H, Yu FX, et al. Federated learning: Strategies for improving communication efficiency. ArXiv: 1610.05492. [Google Scholar]
  • McMahan B, Moore E, Ramage D, et al. Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, 2017, 54: 1273-1282. [Google Scholar]
  • Rizk E, Vlaski S, Sayed AH. Dynamic federated learning. ArXiv: 2002.08782. [Google Scholar]
  • Vlaski S, Rizk E, Sayed AH. Second-order guarantees in federated learning. In: Proceedings of the 2020 54th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, 2020, 915-922. [Google Scholar]
  • Bonawitz K, Eichner H, Grieskamp W, et al. Towards federated learning at scale: System design. ArXiv: 1902.01046. [Google Scholar]
  • Gu R, Yang S, Wu F. Distributed machine learning on mobile devices: A survey. ArXiv: 1909.08329. [Google Scholar]
  • Kairouz P, Brendan McMahan H, Avent B, et al. Advances and open problems in federated learning. ArXiv: 1912.04977. [Google Scholar]
  • Li T, Sahu AK, Talwalkar A, et al. Federated learning: Challenges, methods, and future directions. IEEE Signal Process Mag 2020; 37: 50–60. [Article] [NASA ADS] [Google Scholar]
  • Liu J, Huang J, Zhou Y, et al. From distributed machine learning to federated learning: a survey. Knowl Inf Syst 2022; 64: 885–917. [Article] [Google Scholar]
  • Bellet A, Guerraoui R, Taziki M, et al. Personalized and private peer-to-peer machine learning. ArXiv: 1705.08435. [Google Scholar]
  • Kaissis G, Ziller A, Passerat-Palmbach J, et al. End-to-end privacy preserving deep learning on multi-institutional medical imaging. Nat Mach Intell 2021; 3: 473–484. [Article] [CrossRef] [Google Scholar]
  • Lian X, Zhang C, Zhang H, et al. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In: Proceedings of 31th Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, 2017, 30: 5330-5340. [Google Scholar]
  • Tang Z, Shi S, Chu X. Communication-efficient decentralized learning with sparsification and adaptive peer selection. ArXiv: 2002.09692. [Google Scholar]
  • Wang S, Tuor T, Salonidis T, et al. Adaptive federated learning in resource constrained edge computing systems. IEEE J Sel Areas Commun 2019; 37: 1205–1221. [Article] [CrossRef] [Google Scholar]
  • Warnat-Herresthal S, Schultze H, Shastry KL, et al. Swarm Learning for decentralized and confidential clinical machine learning. Nature 2021; 594: 265–270. [Article] [Google Scholar]
  • Blot M, Picard D, Cord M, et al. Gossip training for deep learning. ArXiv: 1611.09726. [Google Scholar]
  • Daily JA, Vishnu A, Siegel CM, et al. Gossipgrad: Scalable deep learning using gossip communication based asynchronous gradient descent. ArXiv: 1803.05880. [Google Scholar]
  • Hu C, Jiang J, Wang Z. Decentralized federated learning: A segmented gossip approach. ArXiv: 1908.07782. [Google Scholar]
  • Scardapane S, Spinelli I, Lorenzo PD. Distributed training of graph convolutional networks. IEEE Trans Signal Inf Process over Networks 2021; 7: 87–100. [Article] [CrossRef] [MathSciNet] [Google Scholar]
  • Pei Y, Mao R, Liu Y, et al. Decentralized federated graph neural networks. In: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), Montreal, 2021. [Google Scholar]
  • Lalitha A.Fully decentralized federated learning. In: Proceedings of Annual Conference on Neural Information Processing Systems (NeurIPS), Montreal, 2018. [Google Scholar]
  • Lalitha A, Kilinc OC, Javidi T, et al. Peer-to-peer federated learning on graphs. ArXiv: 1901.11173. [Google Scholar]
  • Pappas C, Chatzopoulos D, Lalis S, et al. Ipls: A framework for decentralized federated learning. In: Proceedings of 2021 IFIP Networking Conference, Espoo, 2021. [Google Scholar]
  • Savazzi S, Kianoush S, Rampa V, et al. A joint decentralized federated learning and communications framework for industrial networks. In: Proceedings of 2020 IEEE 25th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), 2020, 1-7. [Google Scholar]
  • Wu X, Wang Z, Zhao J, et al. FedBC: Blockchain-based decentralized federated learning. In: Proceedings of 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, 2020, 217-221. [Google Scholar]
  • Lyu L, Yu J, Nandakumar K, et al. Towards fair and privacy-preserving federated deep models. IEEE Trans Parallel Distrib Syst 2020; 31: 2524–2541. [Article] [CrossRef] [Google Scholar]
  • Yapp AZH, et al. Communication-efficient and scalable decentralized federated edge learning. In: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), Montreal, 2021. [Google Scholar]
  • Zhu L, Liu Z, Han S. Deep leakage from gradients. In: Proceedings Annual Conference on Neural Information Processing Systems (NeurIPS), Vancouver, 2019. [Google Scholar]
  • Koloskova A, Stich SU, Jaggi M. Decentralized stochastic optimization and gossip algorithms with compressed communication. ArXiv: 1902.00340. [Google Scholar]
  • Li X, Huang K, Yang W, et al. On the convergence of FedAvg on non-IID data. ArXiv: 1907.02189. [Google Scholar]
  • Olfati-Saber R, Murray RM. Consensus problems in networks of agents with switching topology and time-delays. IEEE Trans Automat Contr 2004; 49: 1520–1533. [Article] [CrossRef] [MathSciNet] [Google Scholar]
  • Ren W, Beard RW. Consensus seeking in multiagent systems under dynamically changing interaction topologies. IEEE Trans Automat Contr 2005; 50: 655–661. [Article] [CrossRef] [Google Scholar]
  • Jadbabaie A, Jie Lin A, Morse AS. Coordination of groups of mobile autonomous agents using nearest neighbor rules. IEEE Trans Automat Contr 2003; 48: 988–1001. [Article] [Google Scholar]
  • Isermann R.Fault-diagnosis systems: an introduction from fault detection to fault tolerance. Heidelberg: Springer Science & Business Media, 2005. [Google Scholar]
  • Yuan Y, Ma G, Cheng C, et al. A general end-to-end diagnosis framework for manufacturing systems. Natl Sci Rev 2020; 7: 418–429. [Article] [CrossRef] [PubMed] [Google Scholar]
  • Yang T, Yi X, Wu J, et al. A survey of distributed optimization. Annu Rev Control 2019; 47: 278–305. [Article] [CrossRef] [MathSciNet] [Google Scholar]
  • DeMillo RA.Foundations of secure computation. Technical Report. Georgia Institute of Technology, 1978. [Google Scholar]
  • Yao AC.Protocols for secure computations. In: Proceedings of 23rd Annual Symposium on Foundations of Computer Science (SFCS 1982), IEEE, Washington D C, 1982, 160-164. [Google Scholar]
  • Yu Y, Xie X. Privacy-preserving computation in the post-quantum era. Natl Sci Rev 2021; 8: nwab115. [Article] [CrossRef] [PubMed] [Google Scholar]
  • Lyu L, Yu H, Yang Q. Threats to federated learning: A survey. ArXiv: 2003.02133. [Google Scholar]

All Figures

thumbnail Figure 1

Illustration of key concepts in different state-of-the-art federated learning frameworks. (A) Classical Federated Learning: a central client is needed to receive and transmit all essential information to other clients. It is equivalent to an all-to-all network without such a central center, i.e., every client in the network can receive information from all other clients. (B) Swarm Learning: there is no such a universal central client, but a potentially different central client is selected in every iteration. Mathematically, it is equivalent to FedAvg with varying central clients. (C) The proposed Decentralized Federated Learning: there is no need for a central client in any iteration. Any connected time-invariant or time-varying topology (under certain mild condition given in the Supplementary Information) has guaranteed performance, therefore unifying the classical federated learning and swarm learning.

In the text
thumbnail Figure 2

DeceFL to predict leukaemia from A2 benchmark dataset [23]. (A) Data were divided into IID samples for all clients. (B) Data were divided into Non-IID unbalanced samples. (C) Different topologies for FedAvg, SL and DeceFL respectively. The topology for SL must hold in every iteration when any other node is selected as a central client. (D)–(G) Performance of three algorithms on IID/Non-IID setups over logistic regression/neural networks. DeceFL presents a similar performance as FedAvg and SL in both IID and Non-IID scenarios, except during the transient period that DeceFL takes to reach consensus. In contrast to fixed topology in FedAvg and SL, DeceFL provides much more freedom on the choice of communication topology.

In the text
thumbnail Figure 3

DeceFL to predict leukaemia from A2 benchmark dataset [23]. (A) Time-varying communication topology that consists of a sequence of graphs each of which is not connected while the lump-sum graph over a fixed period is connected. (B) Time-varying communication topology that adds or removes nodes over time. (C), (D) Performance of DeceFL with edge-varying graphs on the IID and Non-IID setups of dataset A2 using logistic regression, with reference performance of FedAvg that uses full information. (E), (F) Performance of DeceFL with node-varying graphs on the IID and Non-IID setups of dataset A2 using logistic regression, with reference performance of FedAvg that uses full information. These time-varying experiments manifest the robustness of DeceFL in the presence of interventions to communication topology. The dropout or supplement of clients in the middle of DeceFL computation will not break or deteriorate the learning process, which is particularly essential in practice for large-scale time-consuming data-intensive applications.

In the text
thumbnail Figure 4

DeceFL to detect bearing faults from CWRU benchmark dataset. (A) Data were divided into IID samples for all 4 clients. (B) Data were divided into Non-IID unbalanced samples. Each client locally specified its data size and sample distribution. (C) Illustration of communication topology for FedAvg, SL and DeceFL. (D), (E) Performance of DeceFL on IID data using logistic regression and DNN, respectively, with reference performance of FedAvg and SL. (F), (G) Performance of DeceFL on Non-IID data using logistic regression and DNN, respectively, with reference performance of FedAvg and SL. (D)–(G) The boxplots at bottom illustrate the performance comparison between DeceFL and each client that was trained independently (that is, each client trained its own model only using its associated local data without communicating with any other clients). The experiments on CWRU benchmark dataset confirms the similar performance of DeceFL, as a fully decentralized framework, in comparison to FedAvg and SL, for industrial datasets, while DeceFL offers more freedom on the choice of communication topology to handle practical issues in organizing clients for large-scale federated learning applications.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.