Skip to main content

Survey of Cooling and Energy Efficiency Techniques Used in High Performance Computing

 Survey of Cooling and Energy Efficiency Techniques Used in High Performance Computing

Student Name:
University Name:








Contents













Abstract

The energy consumption in the high performance computing and cloud services is enormous. This requires energy efficiency techniques to reduce power consumption in operation and cooling of the data servers and high performance computers. To search for more energy efficient techniques to save power, companies across the world share data and conduct survey. This article gives an overview of requirement of energy efficiency, different energy efficiency algorithms for supercomputers, cooling and energy efficiency techniques used across the globe.

Introduction

There are multiple sources for energy consumption in high performance computing such as power consumption in running High Performance Computing (HPC), running of data servers, and cooling of HPC and servers and heat losses in the overall process. According to the Moore’s law for power consumption “the power consumption of each computer nodes doubles every 18 months. Top500 super computer lists shows that the highest power consumption super computer runs at 6.95 MW (Borah, Muchahary, & Singh, 2015; Liu & Zhu, 2010). Cloud computing, supercomputers, computer networks has completely changed how information is sent and received. But, the maintenance of the whole infrastructure running throughout the 24/7 requires a lot of power. Companies across the globe such as Google, Facebook, Amazon, Microsoft, YouTube are trying to add new energy efficient technologies and techniques to save power. Also, these companies are moving towards the renewable energy to reduce the Global warming impact. In order to serve billions of customers, the supercomputer, data centers and related infrastructure is increasing at a rapid pace (Allard et al., 2010; J G Koomey, 2007; Jonathan G Koomey, 2008).

Energy Efficiency and Challenges

Energy efficiency means a reduction in the energy consumption for a particular set of operations. These days, all electrical equipment comes with star ratings which show their energy efficiency. However, it is quite difficult to put an energy rating for computer and network related electrical equipments as there is multiple system and sub-system connected together.
Some of the challenges faced by the company in the high performance supercomputing are as follows:
  1. Handling of large amount of data related to research, financial transactions require very high clock speed of CPU in the range TFLOPs supercomputers which consume power in kW to MW.
  2. Large amount of data generated by the supercomputer, research, videos, and photos is always on the server so that customer can access from anywhere. This requires 24*7 running of the data centers.
  3. The servers and high speed computers run only 10-50% of their efficiency most time. Just to meet the occasional high demand and service agreements, the extra resources are pooled in. This results in servers and supercomputers having low energy efficiency (Manasse, McGeoch, & Sleator, 1990; Speitkamp & Bichler, 2010).
  4. Restriction on compaction: High demand of cooling to power consumption restricts compaction in the server and poses a great challenge in the development of superfast computers. Also, there is high rate of hard disk failure if temperature rises above 15 degree Celsius normal temperature (Duchessi & Chengalur-Smith, 1998).
  5. Huge electricity bill for operation and cooling of the computer hardware, data centers, network and server.
  6. It becomes an apparent hindrance in the development of the high speed computing due to large cooling cost. An average 0.5 watt of electrical energy consumed in cooling on every watt used by the server. This has put an immense pressure on the development of new infrastructure as companies are trying to be more energy efficient to reduce electricity bill, carbon footprint.
So, the energy efficiency and carbon footprint reduction is the main goal in the high performance computing, data centers, networks and servers. This diagram shows the energy uses in a system.

Energy conservation strategies

Renewable Energy use: New data centers of facebook, Google, Amazon are using renewable energy sources like solar energy, and wind power to feed the data centers and supercomputers.
Nano data Centers: Nano data centers are a possible answer for energy conservation over the modern typical data centers. A large number of small scale interconnected data centers distributed geographically can provide the same computing, storage power and peer to peer model. It can save 20-30% energy over traditional data centers.
Energy efficient storage: The data storage should be moved from hard disk drive to solid state storage system which has no moveable parts. So, it consumes less energy in cooling. The life time existing data centers is 9-10 years before changing the data from one hard disk to another. So, it is important that new storage options should be adopted.
Live Migration of Virtual Machine (VM): This method is used to moving VM from one operator to another host. It helps the data center in load balancing, power management, and IT maintenance.
Energy Input
Overhead of the supporting subsystems
Energy not consumed by any subsystem
Energy Input                                                                                         Energy Loss
Energy used
Energy waste
Energy consumed
Task
Idle run of the system
Redundant run of the system



The energy efficiency can be improved by reducing the energy loss and energy waste as shown in the above diagram. The energy loss can be minimized by installing more energy efficient system with less leakage. Energy losses due to the supporting subsystem can be reduced by implementing a single cooling unit for the whole cabinet instead of each rack in the server. The idle run of the system can be reduced easily by using better algorithms. Similarly, the development of better algorithm can help to reduce the time taken by the system to achieve the same result in less time and save energy.
Table1. Energy consumption in data center and supercomputers across USA (2013)
YearEnd use
Energy (B kWh)
Elec. Bills (US, $B)CO2 (US) (million MT)
201391$9.097
 139$13.7147
2013-2020 increase47$4.750

Figure.2: Energy consumption in the data enter across the globe (J G Koomey, 2007)
This figure shows that more than 1% of the world’s total electricity is consumed by the data center and supercomputers in the world.
There are different definitions used for the data center power density to describe a single data center by taking into account space, hardware, cooling, etc. The most commonly used parameter takes into account power and cooling requirement on the basis of per rack. Real time variation of power and cooling requirement of the racks vary to a great extent. In most cases, less than 5 kW power is drawn by a rack. However, new racks with liquid cooling can draw power up to 45 kW (J G Koomey, 2007; Jonathan G Koomey, 2008).
Table2. A 500 kW data center specification
500 kW Data Center ParametersMetric units
Total power consumed by IT equipment500,000 Watts
Total space consumed by IT equipment260 m2
Book-room area devoted to cooling plant, switchgear etc.130 m2
Total data center floor space390 m2
Footprint per IT rack enclosure0.622 m2
Quantity of rack enclosures100

Power management techniques

Different techniques are used to reduce the power consumption, peak power level consumption, cooling cost. These techniques are broadly classified into three categories:
  1. DVFS (dynamic voltage/frequency scaling) based techniques: The clock frequency of the processor is dynamically adjusted to reduce supply voltage and reduce the power consumption.
P= P static + CFVwhere, C= capacitance of the transistor gates, F = operating frequency and V= supply voltage. The frequency at which circuit is clocked = voltage required for stable operation; the intelligent reduction of the frequency can reduce the supply voltage and lead to power savings. However, this can lead to reduction in the performance of the processor.
  1. Techniques used for transition of nodes, devices into low power state or switched off mode when there is low demand. This is called dynamic resource sleeping (DRS). Also, servers are consolidated into smaller numbers. Few servers’ run at their full capacity while others can be kept in hibernation state to save power.
  2. Workload management techniques to distribute work on different servers: To estimate the workload in advance using CPU utilization data to predict the demand and switch off servers which are low in use by shifting to other servers. Most time, servers load is below 10%. Only sometime, it spikes 50% of the total installed capacity.
  3. Thermal management techniques: Increasing the power density of the racks in a data server or in the supercomputers leads to high temperatures. This decreases the system reliability due to sudden failure of the system and increased cooling cost. The heating can be measured using either a) direct thermal profiling, b) indirect power profiling. However, it is better indirect thermal profiling don’t give an accurate reading all the time. These techniques take into account the temperature of the system and ensure cooling of the server. Also, the temperature aware workload cooling can help a lot in saving the cooling cost. This is a proportional integral differential (PID) controller. This sensor measures the temperature of exhaust air after cooling the datacom and sends the feedback to the system and maintains a constant temperature of the system. In this diagram, L is the threshold temperature and T is the measured temperature of the system.

Power Capping

It is a technique used to keep the power consumption at a safe threshold level. It is based on a long term study of Ranganathan et al. which shows that the resource utilization is low and high spikes in power consumption not happen in a frequent manner. Also, very low probability of spikes at all servers at the same time.
A high performance computer can be divided into three sub-groups: a) front-end server nodes, b) computing nodes, c) storage nodes. Based on survey data, the CPU consumes the highest amount of power. The CPU performance analysis helps in getting a better picture of the processes which are important for the system to run effectively. Appropriate frequencies of the CPU are assigned based on the type of the job to improve the performance (Beloglazov, Buyya, Lee, & Zomaya, 2011; J G Koomey, 2007; Jonathan G Koomey, 2008; Mittal, 2014; Tang, Mukherjee, Gupta, & Cayton, 2006).
In communication boundness, one node waits for another node because another node is slow or performing some task. Techniques are developed to study such phenomenon:
CPU performance (Source: (Liu & Zhu, 2010))
  • Communication intensive region analysis: In this method, when an MPI call is long enough (i.e. some part of the task require more time than the other parts), the CPU runs at a slower speed to complete the task as some task takes more time to complete the job. In this way, energy is saved by running CPU at slower speed without compromising the quality.
Rountree et.al showed another method in which each task is decomposed into smaller task and executed through several parallel nodes. As the task is sent through critical paths (longest path to the nodes), therefore it doesn’t wait for the data and completes the job. If there is any problem then it can process again. In this manner, the program achieved 20% energy efficiency with 1% increase in the execution time.
  1. Job allocation approaches: In this method, SLURM software is used to power management and job scheduling for the super computer. It also takes into the account the idle nodes and informs the user. It is totally dependent on the user to put the node in low power. It also increases or decreases the number of active nodes up and down to prevent sudden rise in power demand. But, the choice of job scheduling is dependent on the user.
Zong et. al has given another solution based on the heterogeneous architecture of the HPC. It is taken into consideration that nodes are having different speed, power requirements, execution time for a specific task and different communication time depending on their location in the supercomputer.  So, an allocation algorithm called as Energy efficient task duplication schedules (EETDS) is used to assign the task in parallel to different phases: grouping and allocation phase. In grouping phase different task s are assigned based on the communication required to complete the task and allocation phase generates the report for time taken by different nodes to complete different parts of the task. Thus, the task which takes more time to complete the job are given to the faster  nodes to complete in less time and finally complete all the task in minimum energy consumption. This algorithm improved the energy efficiency by 47.1% with a little degradation in the performance (Beloglazov et al., 2011; Borah et al., 2015; Liu & Zhu, 2010; Mittal, 2014; Tang et al., 2006).
New supercomputers have already achieved the performance 5.27 GFLOPS/W. Also, new supercomputers developers are giving more focus towards energy efficiency then raw power.

Cooling mechanism in Supercomputer and Data centers

There are different types of cooling mechanisms used in the supercomputer and data centers. In supercomputer, liquid based coolants are used to maintain average temperature around 65 degrees Celsius. Due to smaller size and higher speed in the range of TFlops 500 and above, huge power is consumed by the microprocessor of the supercomputers.
Chillers, air conditioned fan and liquid coolant are used to cool the supercomputers. This figure shows a data center cooling using the chillers and cold air.
Now, most supercomputers use hot liquid water to cool the microprocessor. There two different types of liquid cooling for the servers:
  1. Indirect liquid cooling: For dense racks, a rear-door water-cooled heat exchangers (RDHx) is used to supply liquid closed to the racks. However, it depends on controlled air to cool the datacom equipment. The advantage of this method is there is less travelling distance for cooled air which reduces the energy loss.
  2. Total liquid cooling: In this method, liquid solutions directly cool the datacom equipment. This method uses the non-conducing (dielectric) liquids which cools the datacom equipment. The liquid is pumped or rely on phase change and or use the natural convection process to transfer heat from the equipment. These dielectric liquids are cooled using water or any other liquid. Also, there is increased focus has been given to energy efficiency to decrease the cooling expenses. The power effectiveness has been achieved for data enter are in the range of 1.05 to 1.10. However, some of the data centers run at a PUF of 1.50. In data centers, racks are completely submerged in the liquid water which ensures the constant temperature and reduces the cost of cooling. However, liquid cooling of the data centers is not widely in application as there is high upfront cost. Also, companies prefer to setup their data centers in the cold region to avoid expenses in the cooling. The average prescribed temperature of the data center is 27 degree Celsius but some data centers run at temperature up to 35 degree Celsius (Ham, Kim, Choi, & Jeong, 2015; Iyengar et al., 2012; Pakbaznia & Pedram, 2009).












References

Allard, T., Anciaux, N., Bouganim, L., Guo, Y., Le Folgoc, L., Nguyen, B., … Yin, S. (2010). Secure personal data servers: A vision paper.Proceedings of the 36th VLDB (Very Large Data Bases)3(1), 25–35.
Beloglazov, A., Buyya, R., Lee, Y. C., & Zomaya, A. (2011). A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing SystemsAdvances in Computers (Vol. 82). doi:10.1016/B978-0-12-385512-1.00003-7
Borah, A. D., Muchahary, D., & Singh, S. K. (2015). Power Saving Strategies in Green Cloud Computing Systems, 8(1), 299–306. doi:10.14257/ijgdc.2015.8.1.28
Duchessi, P., & Chengalur-Smith, I. (1998). Client/server benefits, problems, best practices. Communications of the ACM41(5), 87–94. doi:10.1145/274946.274961
Ham, S. W., Kim, M. H., Choi, B. N., & Jeong, J. W. (2015). Simplified server model to simulate data center cooling energy consumption.Energy and Buildings86, 328–339. doi:10.1016/j.enbuild.2014.10.058
Iyengar, M., David, M., Parida, P., Kamath, V., Kochuparambil, B., Graybill, D., … Chainer, T. (2012). Server liquid cooling with chiller-less data center design to enable significant energy savings. InAnnual IEEE Semiconductor Thermal Measurement and Management Symposium (pp. 212–223). doi:10.1109/STHERM.2012.6188851
Koomey, J. G. (2007). Estimating total power consumption by servers in the US and the world. Final Report. February15.
Koomey, J. G. (2008). Worldwide electricity used in data centers.Environmental Research Letters3(3), 034008. doi:10.1088/1748-9326/3/3/034008
Liu, Y., & Zhu, H. (2010). A survey of the research on power management techniques for high-performance systems, 943–964. doi:10.1002/spe
Manasse, M. S., McGeoch, L. A., & Sleator, D. D. (1990). Competitive algorithms for server problems. Journal of Algorithms11(2), 208–230. doi:10.1016/0196-6774(90)90003-W
Mittal, S. (2014). Power Management Techniques for Data Centers: A Survey. arXiv Preprint arXiv:1404.6681. doi:10.2172/1150909
Pakbaznia, E., & Pedram, M. (2009). Minimizing data center cooling and server power costs. Proceedings of the 14th ACMIEEE International Symposium on Low Power Electronics and Design ISLPED 0939(10), 145. doi:10.1145/1594233.1594268
Speitkamp, B., & Bichler, M. (2010). A mathematical programming approach for server consolidation problems in virtualized data centers. IEEE Transactions on Services Computing3(4), 266–278. doi:10.1109/TSC.2010.25
Tang, Q., Mukherjee, T., Gupta, S. K. S., & Cayton, P. (2006). Sensor-based fast thermal evaluation model for energy efficient high-performance datacenters. Proceedings – 4th International Conference on Intelligent Sensing and Information Processing, ICISIP 2006, 203–208. doi:10.1109/ICISIP.2006.4286097
Allard, T., Anciaux, N., Bouganim, L., Guo, Y., Le Folgoc, L., Nguyen, B., … Yin, S. (2010). Secure personal data servers: A vision paper.Proceedings of the 36th VLDB (Very Large Data Bases)3(1), 25–35.
Beloglazov, A., Buyya, R., Lee, Y. C., & Zomaya, A. (2011). A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing SystemsAdvances in Computers (Vol. 82). doi:10.1016/B978-0-12-385512-1.00003-7
Borah, A. D., Muchahary, D., & Singh, S. K. (2015). Power Saving Strategies in Green Cloud Computing Systems, 8(1), 299–306. doi:10.14257/ijgdc.2015.8.1.28
Duchessi, P., & Chengalur-Smith, I. (1998). Client/server benefits, problems, best practices. Communications of the ACM41(5), 87–94. doi:10.1145/274946.274961
Ham, S. W., Kim, M. H., Choi, B. N., & Jeong, J. W. (2015). Simplified server model to simulate data center cooling energy consumption.Energy and Buildings86, 328–339. doi:10.1016/j.enbuild.2014.10.058
Iyengar, M., David, M., Parida, P., Kamath, V., Kochuparambil, B., Graybill, D., … Chainer, T. (2012). Server liquid cooling with chiller-less data center design to enable significant energy savings. InAnnual IEEE Semiconductor Thermal Measurement and Management Symposium (pp. 212–223). doi:10.1109/STHERM.2012.6188851
Koomey, J. G. (2007). Estimating total power consumption by servers in the US and the world. Final Report. February15.
Koomey, J. G. (2008). Worldwide electricity used in data centers.Environmental Research Letters3(3), 034008. doi:10.1088/1748-9326/3/3/034008
Liu, Y., & Zhu, H. (2010). A survey of the research on power management techniques for high-performance systems, 943–964. doi:10.1002/spe
Manasse, M. S., McGeoch, L. A., & Sleator, D. D. (1990). Competitive algorithms for server problems. Journal of Algorithms11(2), 208–230. doi:10.1016/0196-6774(90)90003-W
Mittal, S. (2014). Power Management Techniques for Data Centers: A Survey. arXiv Preprint arXiv:1404.6681. doi:10.2172/1150909
Pakbaznia, E., & Pedram, M. (2009). Minimizing data center cooling and server power costs. Proceedings of the 14th ACMIEEE International Symposium on Low Power Electronics and Design ISLPED 0939(10), 145. doi:10.1145/1594233.1594268
Speitkamp, B., & Bichler, M. (2010). A mathematical programming approach for server consolidation problems in virtualized data centers. IEEE Transactions on Services Computing3(4), 266–278. doi:10.1109/TSC.2010.25
Tang, Q., Mukherjee, T., Gupta, S. K. S., & Cayton, P. (2006). Sensor-based fast thermal evaluation model for energy efficient high-performance datacenters. Proceedings – 4th International Conference on Intelligent Sensing and Information Processing, ICISIP 2006, 203–208. doi:10.1109/ICISIP.2006.4286097

Comments

Popular posts from this blog

Should pit bull terriers be banned in my community

 Discussion Forum: Counterarguments (Should pit bull terriers be banned in my community) You created a question about the topic for your W6 Rough Draft. For this discussion, you will give an answer to that question in the form of a thesis statement. "Dieting Makes People Fat" Main Post: Share your thesis statement with your classmates. Please note: As with last week’s discussion, nothing here is set in stone. Be open to changing everything about your topic, including your position and audience, as you research it and get feedback from your classmates. Topic + Position/Purpose + Supporting Points =Thesis Statement Example: Suppose the question you posed in the Week 5 discussion was something like, “Should pit bull terriers be banned in my community?” After doing some preliminary research, you have concluded that pit bulls, if raised properly, are no more dangerous than other breeds of dogs. Your thesis statement can be something like, “Pitbulls should not be banned

Controversy Associated With Dissociative Disorders

 Assignment: Controversy Associated With Dissociative Disorders The  DSM-5-TR  is a diagnostic tool. It has evolved over the decades, as have the classifications and criteria within its pages. It is used not just for diagnosis, however, but also for billing, access to services, and legal cases. Not all practitioners are in agreement with the content and structure of the  DSM-5-TR , and dissociative disorders are one such area. These disorders can be difficult to distinguish and diagnose. There is also controversy in the field over the legitimacy of certain dissociative disorders, such as dissociative identity disorder, which was formerly called multiple personality disorder. In this Assignment, you will examine the controversy surrounding dissociative disorders. You will also explore clinical, ethical, and legal considerations pertinent to working with patients with these disorders. Photo Credit: Getty Images/Wavebreak Media To Prepare · Review this week’s Learning

CYBER SECURITY and how it can impact today's healthcare system and the future

 Start by reading and following these instructions: Create your Assignment submission and be sure to cite your sources, use APA style as required, and check your spelling. Assignment: Recommendations Document Due Week 6 (100 pts) Main Assignment Recommendations Document The 1250 to 1500-word deliverable for this week is an initial draft of your recommendations. Note that this is a working document and may be modified based on insights gained in module eight and your professor's feedback. This document should contain the following elements: Summary of your problem or opportunity definition A list of possible recommendation alternatives. In this section, you are not yet at the point of suggesting the best set of recommendations but you are trying to be creative and explore all the different ways that the problem or opportunity might best be addressed. The end result here will be a list of alternatives among which you will choose your final recom