如果一味強(qiáng)調(diào)安全,而缺乏一個(gè)與之匹配的可靠性流程,這相當(dāng)于為災(zāi)難性錯(cuò)誤打開(kāi)了大門(mén)。
2018 年 3 月 18 日,世界上首起自動(dòng)駕駛汽車(chē)致行人死亡事故在美國(guó)亞利桑那州坦佩市發(fā)生。該事件引起了巨大轟動(dòng),全球范圍內(nèi)有關(guān)本次事故的文章達(dá)到了近一萬(wàn)篇,其中大多數(shù)均探討了本次事故對(duì)優(yōu)步(Uber)、自動(dòng)駕駛汽車(chē)、公共道路自動(dòng)駕駛汽車(chē)測(cè)試及更廣泛社會(huì)的影響。
然而,沒(méi)有多少文章真正探討了自動(dòng)駕駛汽車(chē)的傳感器、軟件和平臺(tái)技術(shù)可以從這一悲慘事件中吸取哪些教訓(xùn)。事實(shí)上,自動(dòng)駕駛汽車(chē)要想真正實(shí)現(xiàn)經(jīng)濟(jì)可行性,就必須從事故中吸取經(jīng)驗(yàn)教訓(xùn)。
無(wú)論是為了從坦佩事故中吸取教訓(xùn),還是真正理解 ISO 26262(道路車(chē)輛功能安全標(biāo)準(zhǔn))的價(jià)值,我們其實(shí)面臨著一個(gè)共同的基本挑戰(zhàn):清楚地認(rèn)識(shí)“可靠性”和“安全性”之間的互補(bǔ)和矛盾之處。這并不單純指字面意義:每位經(jīng)理都明白,在任何一個(gè)軟件和硬件設(shè)計(jì)周期中,流程、權(quán)力和責(zé)任的劃分至關(guān)重要:誰(shuí)做什么工作?向誰(shuí)報(bào)告?何時(shí)進(jìn)行?這些問(wèn)題的處理方式不同都會(huì)導(dǎo)致截然不同的結(jié)果。
可靠性是什么?安全性又是什么?這兩者在企業(yè)環(huán)境中又應(yīng)保持何種關(guān)系?從可靠性工程師的視角來(lái)看,安全性不過(guò)是可靠性的一部分。為什么?因?yàn)榭煽啃詧F(tuán)隊(duì)關(guān)注的是故障發(fā)生的概率,而安全性團(tuán)隊(duì)則關(guān)注故障發(fā)生且導(dǎo)致災(zāi)難性后果(損失、受傷或死亡)的概率。
對(duì)于可靠性團(tuán)隊(duì)而言,預(yù)防并處理這些災(zāi)難性事件的概率,僅是他們工作中的一小部分而已。因此,在一個(gè)以可靠性為核心的環(huán)境中,安全工程師直接接受可靠性團(tuán)隊(duì)的管理,且在完整可靠性設(shè)計(jì)(DfR)流程走完前,不會(huì)采取行動(dòng)。
可靠性和安全性的相互作用
顯而易見(jiàn),安全工程師并不認(rèn)同這一觀點(diǎn)。從他們的視角來(lái)看,可靠性分析只能提供特定失效機(jī)制(可靠性物理學(xué))或部件(經(jīng)驗(yàn)學(xué))失效的概率??煽啃苑治霾粫?huì)涉及故障發(fā)生的具體后果——這會(huì)是災(zāi)難性的嗎?因此,可靠性分析只有深入到系統(tǒng)最下層時(shí),才往往是最有效的。只有這時(shí),分析人員才更能了解系統(tǒng)或用戶對(duì)故障的反應(yīng),從而分析每個(gè)故障可能引發(fā)的后果嚴(yán)重性。因此,可靠性工程師應(yīng)當(dāng)接受安全團(tuán)隊(duì)的管理。
可靠性工程師的主要職責(zé)是計(jì)算故障率和基本故障模式。如果有時(shí)這些失敗率不過(guò)只是數(shù)字而已,那么可靠性工程師有什么存在的必要呢?
此外,第三種觀點(diǎn)是,可靠性和安全性之間的聯(lián)系并沒(méi)有人們想象的那么緊密。我們可以用這兩個(gè)學(xué)科分別“如何解決風(fēng)扇性能”的問(wèn)題更好地陳述這兩者之間的差別??煽啃怨こ處煏?huì)采取可靠性物理分析(RPA)、降速或加速壽命試驗(yàn)(ALT)等措施,確保將風(fēng)扇在預(yù)期環(huán)境中的故障率降至目標(biāo)水平之下。對(duì)比之下,安全性工程師則會(huì)首先判斷風(fēng)扇故障是否會(huì)引發(fā)災(zāi)難性事件(及這將給系統(tǒng)其他部分帶來(lái)哪些影響),然后采用“漂移”(drift)增加冗余或調(diào)整關(guān)鍵參數(shù)(如電流消耗、轉(zhuǎn)速表、噪音)等方式,降低事故的嚴(yán)重程度。
這些不同觀點(diǎn)恰好反映了科技公司在“如何處理可靠性和安全性之間關(guān)系”方面的猶豫。在一家正在向自動(dòng)駕駛汽車(chē)轉(zhuǎn)型的大型消費(fèi)者技術(shù)公司中,可靠性和安全性團(tuán)隊(duì)匯報(bào)給同一位總監(jiān)。另一家自動(dòng)駕駛領(lǐng)導(dǎo)者公司則將安全性和可靠性團(tuán)隊(duì)完全分開(kāi),不過(guò)這兩個(gè)部門(mén)主管的職位大致類(lèi)似。我們了解的第三家公司,則是汽車(chē)電子領(lǐng)域中一家大力投入自主控制單元研發(fā)的中流砥柱。這家公司也將安全性和可靠性團(tuán)隊(duì)完全分開(kāi),但安全團(tuán)隊(duì)主管的職位明顯更高,相較而言可靠性團(tuán)隊(duì)中職位最高的員工不過(guò)是經(jīng)理或組長(zhǎng),這也反映了這家公司在這兩支團(tuán)隊(duì)中的“偏重”。
如果無(wú)法清晰理解可靠性和安全性之間的相互作用和相互依賴,汽車(chē)行業(yè)可能會(huì)出現(xiàn)一些本可避免的沖突和誤解,進(jìn)而將顧客置于本不必要的風(fēng)險(xiǎn)之中,或?qū)е伦詣?dòng)駕駛系統(tǒng)的成本過(guò)高,甚至兩者兼而有之。如果對(duì)可靠性過(guò)分缺乏信心,或者公司安全性團(tuán)隊(duì)的權(quán)力過(guò)大,自動(dòng)駕駛汽車(chē)制造商往往會(huì)在整個(gè)車(chē)輛系統(tǒng)中引入大量冗余(包括傳感、控制、動(dòng)力、制動(dòng)等)。據(jù)估算,一輛普通汽車(chē)的電子元器件成本超過(guò) 12000 美元,這些設(shè)計(jì)并不一定可以讓車(chē)內(nèi)人員或整個(gè)交通環(huán)境更加安全,但卻一定會(huì)顯著增加成本。
事實(shí)上,我們還可以用另一個(gè)很好的例子探討安全性和可靠性之間的差異:那就是如何計(jì)算失敗率。從 20 世紀(jì) 50 年代到 90 年代,在一些電子硬件公司中,大多數(shù)可靠性團(tuán)隊(duì)都是憑經(jīng)驗(yàn)來(lái)估算故障率。這些手冊(cè)只是現(xiàn)場(chǎng)故障數(shù)據(jù)的簡(jiǎn)單匯總,按零件類(lèi)型(電阻器、電容器、二極管等等)進(jìn)行區(qū)分。盡管概念簡(jiǎn)單、使用方便,但多項(xiàng)研究均表明這些手冊(cè)在實(shí)際產(chǎn)品的應(yīng)用上非常不準(zhǔn)確,整體估算結(jié)果偏向保守,也往往因此導(dǎo)致預(yù)測(cè)的故障率過(guò)高。
原因很簡(jiǎn)單——這些手冊(cè)的分析并不是基于導(dǎo)致失敗真正發(fā)生的實(shí)際原因。進(jìn)入 21 世紀(jì)之后,大多數(shù)有經(jīng)驗(yàn)的可靠性領(lǐng)域?qū)I(yè)人員也不再僅僅依靠經(jīng)驗(yàn)數(shù)據(jù)來(lái)預(yù)測(cè)失敗率。故障手冊(cè)等過(guò)時(shí)的方法開(kāi)始被可靠性物理分析(RPA)和加速壽命測(cè)試(ALT)等手段取代,這種趨勢(shì)在汽車(chē)行業(yè)中最為明顯。直到 ISO 26262 問(wèn)世。
避免脫節(jié)
作為一項(xiàng)功能安全標(biāo)準(zhǔn),ISO 26262 將根據(jù)“用一定方式計(jì)算出的故障率”以及“系統(tǒng)所采取的緩解措施”,預(yù)測(cè)評(píng)估車(chē)輛的安全完整性等級(jí)(SIL)。與可靠性工程師不同,安全性工程師強(qiáng)烈鼓勵(lì),甚至直接要求將經(jīng)驗(yàn)手冊(cè)作為 SIL 計(jì)算的基礎(chǔ)。這種脫節(jié)的原因很明顯——安全性和可靠性分屬兩個(gè)獨(dú)立團(tuán)隊(duì),也匯報(bào)給不同的管理層,雙方缺乏最基本的溝通,溝通完全脫節(jié),以至安全工程師仍在使用過(guò)時(shí)的方法來(lái)計(jì)算故障率。
如果兩個(gè)團(tuán)隊(duì)之間不能進(jìn)行合理的平衡,安全性團(tuán)隊(duì)往往傾向于給出更高的失敗率,并因此要求采取更多的安全分析和安全威脅緩解措施,包括增加冗余等。此外,安全性團(tuán)隊(duì)過(guò)分專(zhuān)注于經(jīng)驗(yàn)手冊(cè),也會(huì)導(dǎo)致他們忽略一些關(guān)鍵故障模式,使得安全威脅緩解機(jī)制不再有效。
不過(guò),一切仍有改進(jìn)的機(jī)會(huì)。無(wú)論主營(yíng)半導(dǎo)體元件、電子模塊還是完整的系統(tǒng),所有自動(dòng)駕駛技術(shù)價(jià)值鏈上的公司都必須認(rèn)識(shí)到,如果一味強(qiáng)調(diào)安全,而缺乏一個(gè)與之匹配的可靠性流程,這相當(dāng)于為災(zāi)難性錯(cuò)誤打開(kāi)了大門(mén)。
為了避免這種情況,我們第一步可以做的就是打破可靠性和安全性團(tuán)隊(duì)的物理障礙,將這兩支團(tuán)隊(duì)放在同一支領(lǐng)導(dǎo)團(tuán)隊(duì)之下。雙方應(yīng)同意共同實(shí)施最佳做法,包括使用最先進(jìn)的模擬、建模及可靠性物理學(xué)等,為適當(dāng)且有效的風(fēng)險(xiǎn)識(shí)別和緩解奠定基礎(chǔ)。
An overemphasis on safety without a robust and equivalent reliability process and organization will result in errors that could be catastrophic.
On March 18, 2018, the first pedestrian fatality due to the operation of an autonomous vehicle occurred in Tempe, Arizona. Since then, almost 10,000 articles have been published on this accident, with most of them espousing an opinion on what it all means for future of Uber, autonomous vehicles, public-roads AV testing, and even the larger society.
What is missing from this cauldron of debate is the lessons learned that designers of autonomous sensor, software and platform technologies can extract from this tragic event. Learning from it will be pivotal to the financial success of autonomous vehicles.
A fundamental challenge in learning from the Tempe fatality and in determining the value of ISO 26262 (the functional safety standard for road vehicles) is in identifying the complimentary and contradictory roles of reliability and safety. This is not a matter of semantics: Every manager realizes that process, authority, and responsibility are the core of every software and hardware design cycle. Who does what, who reports to whom and when they do it can it result in dramatically different outcomes.
What is reliability, what is safety, and how should they relate to each other in a corporate environment? From the perspective of reliability engineers, safety is a subset of reliability. Why? While reliability focuses on the probability that a failure will occur, safety assumes the probability that a failure will occur and result in a catastrophic (loss, injury, or death) event.
Catastrophic events are just a small portion of the overall outlook being managed and tracked by the reliability team. Thus, in a reliability-centric world, safety engineers are managed by the reliability team and do not act until a thorough design-for-reliability (DfR) activity is complete.
Reliability and Safety interact
As one would expect, safety engineers do not share the same vision. From their viewpoint, reliability analyses only provide probability of failure for a particular failure mechanism (reliability physics) or part (empirical approach). Reliability analyses have no context as to the consequence of failure—will it be catastrophic? Such analyses are therefore most effective when performed at the lowest level of the system. Because consequences are only clear at the system-level, where the response of the system or the user to the failure can be considered, reliability engineers should report into the safety team.
The key function of reliability engineers is to calculate failure rate and basic failure modes. And since, sometimes, these failure rates are only numbers, why have a reliability engineer at all?
A third viewpoint is that reliability and safety are not as related as one would expect. A prime example of this philosophy is how the two disciplines would address fan performance. From a reliability perspective, the actions might be to ensure the fan meets failure rate goals for the expected environment, either through reliability physics analysis (RPA), derating, or accelerated life testing (ALT).From a safety perspective, the actions might be to determine if fan failure would induce a catastrophic event (how it interacts with the rest of the system) and then introduce potential mitigations, such as redundancy or prognostics using drift or change in key parameters (current draw, tachometer, noise).
These different viewpoints highlight the uncertainty among technology companies on how to handle reliability and safety. One major consumer technology company that is transitioning to autonomous vehicles has Reliability and Safety reporting into the same Director. A second company, a leader in the autonomous field, has Safety and Reliability reporting into two different organizations, even though the leaders in both departments have roughly equivalent titles. A third company, a mainstay in automotive electronics that is aggressively targeting autonomous control units, also has Safety and Reliability in two different organizations, but clearly has a favorite through the numerous executive titles assigned to Safety (while the highest reliability staffer is either Manager or Leader).
Without a clear and consistent construct in how reliability and safety interact and build upon each other, the automotive industry is creating avoidable conflict and potential miscommunication that will either put customers under unnecessary risk, create autonomous systems that are excessively expensive, or both. One autonomous vehicle manufacturer had such uncertain confidence in reliability, or such unlimited authority of the safety team, that it introduced redundancy throughout the vehicle (including sensing, control, power, braking, etc.). Given that the average car has, by some estimates, over $12,000 of electronics, this intro-duces significant costs without necessarily making the occupants, or the traffic around them, that much safer.
A perfect example of this issue is the divergence between safety and reliability in how to calculate failure rates. From the 1950s through the 1990s, most reliability practitioners in electronic hardware organizations used empirical handbooks to calculate failure rates. These handbooks were simply aggregations of field failure data, sorted by part technology (resistor, capacitor, diode, etc.). While simple in concept and execution, repeated studies demonstrated that these handbooks were wildly inaccurate when used on actual product, with the error leaning towards the conservative—over-predicting failure rate.
The reason was straightforward - these handbooks were not based on the actual mechanisms that cause failure. Fast forward to the 21st century and most skilled reliability practitioners no longer rely exclusively on empirical field data to predict failure rates. Reliability physics analysis (RPA) and accelerated life testing (ALT) replaced these outmoded approaches and nowhere was this truer than in the automotive industry. Until ISO 26262 came along.
Avoiding the disconnect
As a functional safety standard, ISO 26262 requires the computation of failure rates and the appropriate mitigations to predict the safety integrity level (SIL).And the safety community, unlike the reliability engineers, strongly encourage or even require empirical prediction handbooks to be the basis of SIL calculations. This disconnect is driven by the lack of a universal construct between reliability and safety. Creating separate organizations reporting into separate management has led to a breakdown in communication, causing safety engineers to use outmoded approaches for failure rate calculations.
In addition, without a balance between the two groups, safety teams will tend to prefer higher failure rates, which requires additional safety analyses and safety mitigations including redundancy. Safety’s focus on simple handbook calculations will also result in overlooking critical failure modes, such that safety mitigations are no longer effective.
There is still an opportunity for improvement. Players in autonomous technology, from semiconductors to electronic modules to overall systems, must realize that an overemphasis on safety without a robust and equivalent reliability process and organization will result in errors that will be difficult to untangle.
A good first step is to make sure that reliability and safety are within the same organization, reporting to a neutral observer. Both sides should agree to implement best practices, including use of state-of-the-art simulation and modeling and reliability physics to lay the ground work on appropriate and effective risk identification and mitigation.
Author: Craig Hillman
Source: SAE Automotive Vehicle Engineering Magazine