CEB NT-97/07
31 October 97

 
 
 
 
 
 
 
 

Reliability estimation of the controller CARD (SMC) in the TOF system for the AMS experiment
 
 

I.D'Antone, M.Lolli, G.Sola, G.Torromeo
 
 

Istituto Nazionale di fisica Nucleare

Bologna, Italy



 
 
 
 
 
 
 
 

Abstract



In this report the reliability of the control board (SMC) used in the TOF system for the AMS experiment is estimated and some considerations about the hard and soft failures are described. We mainly discuss the architecture board and the reliability features obtained.

During the failure rate evaluation we have always chosen the worst case to use a conservative approach.

We have evaluated an optimum value for the SMC short-term reliability, however for long-life mission some redundancy must be introduced to increase the hardware reliability.
 
 


CENTRO DI ELETTRONICA

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Sezione di Bologna





1. Introduction
 
 

The AMS is an experiment designed to measure the amount of antimatter nuclei present in cosmic rays. Antimatter is identified by means of the absolute charge of a particle crossing the spectrometer and the sign of the charge determined by the curvature of the trajectory and from the flight direction as given by the TOF and Cerenkov counters [1].

The High Voltage Distribution Card (HVDC) designed for the Time Of Flight (TOF) detector has the purpose of providing individually programmable supply voltages to the 21 PMTs covering ¼ of a TOF detector plane.

The HVDC is split into two interconnected parts but it works as a single unit under the supervision of the AMS Slow Control System via a Scintillator Monitor Card (SMC).

By means of the HVDC, the SMC is able to control the value of the primary High Voltage in the range 800 to 1100V. Furthermore the SMC gives the pattern of activation as well as the amplitude to a programmable LED pulser section in the HVDC, to inject test light pulses in the plastic scintillators.

The typical crate of the TOF system is composed of 6 SFEX (SFEX = SFET, SFEA, SFEC) cards performing the time to digital conversion, 1 SMC and a CPU (JDQS) [2].
 
 
 
 

2. SMC block diagram description
 
 

The SMC is the only module in the crate controlled by the CPU. To perform a read/write operation in the SMC or in the SFEX (SFEX = SFET, SFEA, SFEC) cards the JDQS must address the SMC.

In this way, by means of the SMC, the JDQS is able to set the 64 Vref for the TOF HV (by means of the HVDC), ANTI HV, CEREN HV and the 16 Vref in each SFET; furthermore it can read out the two temperature sensors in each SFET.

In other words the SMC during a write cycle performs a buffer operation, while in a read cycle performs a switch of the 8bit data bus to connect the SFEX cards with the CPU JDQS.
 
 

During a write cycle the JDQS, in a first phase, sends the address to identify the six SFEX boards or the SMC. In a second phase the JDQS sends the data that arrive to the board identified in the previous phase.

During a read cycle the JDQS, in a first phase, write the address of the SFEX to read (the SMC cannot be read). With this address the SMC identifies the bus "up" or "down" to connect to the JDQS bus. In a second phase the JDQS read the data from this bus.
 
 

A block scheme of the SMC is shown in the fig.1. The SMC card is divided in two parts: one half the SMC handles the SFEX cards in the top of the crate, while the other half handles SFEX in the bottom of the crate.
 
 

The best number of Vrefs generated by the SMC, for the HV system, has been evaluated in 64. Since the TOF PMs are 336 and the SMC cards are 8 (that is one for each crate), each SMC must generate 336/ 8= 42 signals for the VREF of the HV power supply. Additionally it must generate 2 levels for the main HV voltages + 2 levels for the led pulses. It must generate also other signals for the ANTI HV and CEREN HV systems. Since each DAC contains eight channels per chip, to optimize the DAC + buffers utilization we have decided to generate 64 analog signals per SMC. These analog signals are in the 0-5V voltage range.
 
 
 
 
 
 
 
 
 
 
 
 
 
 


 
 




fig.1
 
 

SMC block scheme


 
 




3. General considerations about the SMC reliability.
 
 

To increase the reliability of the board we have considered several techniques: component selection, derating, burn-in, redundancy.

Field Programmable Gate Arrays (FPGAs) are attractive for space application because of good density and low cost. FPGA devices using the anti-fuse technology (Cypress-QuickLogic, Actel) have become popular with space flight designers. We have used Cypress-Quicklogic FPGAs. They are programmed by "burning" one or more anti-fuses, which otherwise have a very high electrical resistance and separate electrical nodes. During programming a low resistance path between nodes is provided.

From the reliability report of the Cypress company [3] the overall reliability of the pASIC380, which we have used in the SMC board, is 19 FITs with a 60% confidence level.
 
 
 
 

The Failure In Time (FIT) is a measure of the failure rate in 109 device hours, e.g. 1 FIT= 1 failure in 109 device hours. The confidence level is the probability level at which population failure rate estimates are derived from the sample life test.

Furthermore for the other components in the SMC board we have selected devices delivered by Company having a documented Reliability Program and devices having an ultra-low power consumption.

We have used DACs from Analog Devices and micro-power Operational Amplifiers from National Semiconductor. All components employed are SMT (Surface Mount Technology) devices that offer a significant advantage in terms of reduced size and weight.

In the following we give a reliability estimation for the SMC board and discuss the system architecture.

By using the reliability block diagrams and the exponential failure law we perform an evaluation of the reliability controller.

We first investigate the system architecture and after that we examine the component reliability.
 
 
 
 

4. Reliability of a partitioned board


If a controller is partitioned in several independent parts, i.e. four blocks in fig.2, when a part fails it controls only a percentage of the apparatus (75% (3/4) in the case shown in fig.2 ). If the fail condition is not disastrous most of the system can work for a long time to come. In other words, if the system can be partitioned in several parts and a failure in a part can be tolerated, we have a


 
 


fig.2
 
 

partitioning in a controller





reduced system with a higher MTTF, where the MTTF is the expected time that a system will operate before the first failure occurs.
 
 
 
 

In general, if there are N identical modules and M of those are required for the system to function properly, the system can tolerate N-M failures. The reliability of an M-of-N system can be written
 
 
 
 

M-N

RM-of-N(t)= Si C(N,i) RN-i(t) (1-R(t))i

0
 
 

where C(N,i)=N! / ((N-i)! i!).

In fig.3 is shown the reliability of the controller in fig.2 when the failure rate of each part is l = 100 * 10-6 failure per hour. It has been used the exponential form of the reliability function: R(t)= e-lt.
 
 

fig.3
 
 

reliability in a partitioned controller


 
 


Following these criteria we have partitioned the SMC to have two independent parts each controlling 32 channels . Globally one SMC board controls 64 HV channels.
 
 
 
 

5. Reliability estimation for a partitioned SMC.
 
 

In fig.4 is shown the connection graph of the SMC components concerning the HV control. The SMC contains also the SFEX interface connected to the CPU interface through the pASIC, but in the figure we only show the part involved in the HV controls.
 
 
 
 
 
 


 
 




fig.4
 
 

SMC functional blocks


 
 


Each block in fig.4 represents all the components used to have that part working properly. For example the pASIC block contains one pASIC, the resistors, the capacitors, one buffer and the transistor for the pASIC protection. The failure rate l predicted for each part is shown in the following table:
 
 
 
CPU interface block pASIC block DAC block AMP block

l CPU = 96 FITs

l pASIC = 173 FITs

l DAC = 104 FITs

l AMP = 74 FITs


 
 
 

To estimate the failure rate of each specific component we have used the data given by the Company in the reliability reports, or the failure rate of similar components or, when these data are not available, we used the indication given by the MIL-HDBK-217 standard of the United States Department of Defense.

The MIL-HDBK-217 model predicts, for example, the constant failure rate of an integrated circuit (IC) as
 
 

l = pL pQ (C1 pT + C2 pE) pP failures /106 hours where:
 
 

pL is a learning factor (maturity of the fabrication process),

pQ is a quality factor (screening performed by the manufacturer),

pT is the temperature factor (junction operating temperature),

pE is an environmental factor (function of the harshness of the environment),

pP is a pin factor (function of the number of pins),

C1 and C2 are complexity factors (function of the number of gates for logic circuits or the number of transistors for linear circuits).
 
 

The MIL-HDBK-217 is generated through the compilation and analysis of large amounts of empirical reliability data on all types of electronic components. It is not the only reliability handbook. Due to the inherent problems with empirical reliability prediction, MIL-HDBK-217 must be used with caution. Like all reliability prediction tools, it serves better as a reliability comparison tool than an absolute reliability measure.

We use the reliability prediction to analyse the partitioning architecture of the controller card.

The MIL-HDBK-217 handbook gives also a formula to evaluate the failure rates of surface mount technology (SMT) interconnection assemblies, but, in our case, we have considered a factor for each part to consider connector and solder joint failures and the like.
 
 

The reliability block diagram for the controller is shown in fig.5.
 
 
 
 
 
 

fig.5
 
 

reliability block diagram


 
 

All blocks are in series to show that all the elements are required for the controller to function correctly and to handle 64 HV channels. For each component we use the exponential failure law R(t)= e-lt.

The reliability of the controller is: R64ch(t) = e-l64ch t
 
 

where l64ch = l CPU + 2 * l pASIC + 8 * l DAC + 64 * l AMP .
 
 

To compare this reliability with the reliability of the controller handling 32 channels, the reliability block diagram must be modified as shown in fig.6.
 
 

The dotted line indicates 32 working channels without any reference to the half board. We have two part in parallel after the CPU interface. The reliability of the controller is:
 
 

R32ch(t) = RCPU * [1 - (1 - Rs)2],
 
 

where [1 - (1 - Rs)2] is the reliability of two blocks in parallel.

fig.6
 
 

reliability block diagram


 
 




The block has the following reliability:
 
 

Rs = e-ls t , with ls = l pASIC + 4 * l DAC + 32 * l AMP .
 
 

fig.7
 
 

SMC reliability







In fig.7 we have plotted the two reliability R64ch(t) and R32ch(t). Then, if no manteinance is performed on the system during one year of continuous operation, the probability that 64 channels are working properly is 0.956, while the probability that 32 channels are working is 0.998.

Stated differently, we have a probability of (1-0.956) = 0.043 of the controller (64 channels) failing within a year of continuous operations.

In addition to the failure rate, the mean time to failure (MTTF) is a useful parameter to specify the quality of a system. It is equal to the area under the reliability curve, then the MTTF relative to the R64ch(t) is evaluated in 1.66*10^5 hours.
 
 

The failure rate shown above has been estimated for hard faults. Unfortunately there are soft faults due to the occurrence of transient errors. Transients, by definition, are ephemeral and troubleshooting can be difficult. Among the causes there are electromagnetic interference, power supply instability, marginal parametric characteristics at the device level that results in data ambiguity, metastability, single event upsets (SEU) caused by the decay of radioactive materials in device packaging or from environmental (cosmic ray) radiation and protocol errors that create data conflicts.

The occurrence of transient errors in a system has been estimated at anywhere more than 6 times the hard error rate. If we suppose lsoft=10*lhard we obtain (see fig.7) an MTTF= 1.512*10^4 hours =

1.75 years. Then for long-life mission some redundancy must be introduced to increase the hardware reliability.
 
 
 
 

Conclusion
 
 

In this report we have estimated the reliability of the control board used in the TOF system. Some considerations about the hard and soft failure have been described. However we have mainly discussed the architecture board and the reliability features obtained.

During the failure rate evaluation we have always chosen the worst case to use a conservative approach.

In our discussion some simplifications have been introduced, however the work has been useful to identify the best architecture of the controller.

An optimum SMC short-term reliability has been evaluated, however for the long-life mission some further redundancy must be introduced to increase the hardware reliability.
 
 
 
 

Reference
 
 

[1] AMS collaboration, "Ricerca di antimateria nell'universo", Programma scientifico INFN,

Gruppo II, 1996.
 
 

[2] I.D'Antone, G.Torromeo, "Scintillator Monitor card (SMC) for the TOF system of the AMS

experiment", internal note, CEB NT-97/03, 1997
 
 

[3] Cypress, "Programmable Logic DATA BOOK", 1996.
 
 

[4] I.D'Antone, "Reliability and redundancy in complex digital system: an overview", internal note,

CEB NT-96/03, 1996.
 
 


Page edited by Bisi Fabio