Online Safety Community

Evaluating Performance of Distributed Systems with MapReduce

MapReduce in Geographically Distributed Environments

The performance of MAPREDUCE across geographically distributed environments is highly dependent upon the amount of network utilization and the quality of the network bandwidth and latency. Different MapReduce configurations are best suited for different data distribution models. In order to select the appropriate model, it is important to understand the characteristics of the workload of the MapReduce job which you are attempting to complete.

Cardosa, et.al describe three different workload data aggregation schemes for MapReduce jobs (Michael Cardosal, 2011). The first, “High Aggregation”, occurs when the output of the MapReduce process is magnitudes of order smaller than the input. Jobs like these are where input data is categorized and counted and large amounts of matches will be reduced to simple category counts. Examples include MapReduce grep, where word or HREF counts are performed across large amounts of distributed data. The input is a large list of files, and the output is a much smaller list of word counts. The second MapReduce workload scheme mentioned, “Net Zero Aggregation”, occurs when the output from a MapReduce process is approximately equal to the input. Sort is a good example of “Net Zero Aggregation”. With a sort job, the output file structure is typically the same size as the input. The final MapReduce workload scheme discussed, “Ballooning Data”, occurs when the output of the MapReduce function produces more records and data than what was input. An example of this would be a MapReduce job which converts compact formats such as GIF to larger data formats such as JPEG . The amount of data produced is an important factor to consider when architecting a MapReduce solution. In their study, Michael Cardosa, et. al. found that when workloads are highly aggregated, a geographically distributed environment works well. For zero aggregation or ballooning data, centralizing the data before applying map reduce is preferred.

Text Mining Applications with Map Reduce

MapReduce is gaining attention from the scientific community in the area of natural language processing (Atilla Soner Balkir, 2011). Natural language processing models often involve optimization algorithms across large amounts of data. A constraint with natural language processing has always been high speed access to large members of frequently changing parameter values. In information retrieval, the number of times an index term occurs in a document is called its term frequency. The discovery of recurrent phrases automatically from text in a quick turnaround is key to the natural language application as a key phrase can help identify the intent of the user. Balkir, et al. used MapReduce, implemented with Hadoop, to develop a model for chunking up sentences into smaller phrases to help identify recurrent phrases. Their approach speeds up by 6 times the performance required to identify and match these phrases against large, distributed data banks. The use of MapReduce will continue to drive advances in natural language processing, speech recognition, and other forms of artificial intelligence.

Software Quality Assurance

Another application of MapReduce involves analytics of software repositories. Weiyi et al. proposed a study on the use of the model in mining software repositories (Weiyi Shang, 2010). The field of Mining Software Repositories involves analyzing source code, deployment logs, and bug repositories, to find statistical correlations that can be used to identify and address issues in the code such as potential security flaws. To illustrate their approach, they created a MapReduce program to count the number of source code lines. This process was required to evaluate each line of a program to determine if it was an actual instruction or a comment. This type of program had a 20 fold improvement in performance on extremely large repositories over traditional approaches. They conclude that automated software engineering tools play an important role in the analysis of software repositories.

Personalized Page Rank

One of the most well known graph computation problems is computing personalized page ranks (Haveliwala, 2002). Personalized page rank algorithms are used by search providers such as Microsoft, Yahoo and Google to provide the most attractive search results for a given user based on their observed preferences and tendencies. They are also used in link prediction and recommendation engines. The two approaches for computing personalized page ranks are linear algebraic techniques and Monte Carlo simulations. MapReduce can handle this scenario, when there are lower level issues such as job distribution, data storage, and flow, fault tolerance and computational abstraction. In their research, Haveliwala, et. al. showed how MapReduce combined with advanced statistical modeling techniques could provide end search users a personalized experience, considering their intent to reply back with results that were relevant. For example, user who enters the phrase “Recommended Restaurants” will want results that are relevant to their general local.

Summary

As the cost of data farms and utility compute environments decreases, the amount of data that organizations collect and analyze for insights will continue to rise exponentially. In order to process large amounts of data, the simplified MapReduce programming model provides an easy approach. MapReduce is a programming pattern that takes advantage of large utility compute farms with many distributed nodes. MapReduce greatly simplifies the programming paradigm for developers to process large amounts of distributed data. While many proprietary MapReduce implementations  exist, The Apache Hadoop provides an open source implementation that supports MapReduce, distributed file system, and a variety of other functions.

We’ve shown how MapReduce functions, its pros and cons and performance considerations, and some popular implementations, GFS and Hadoop. We’ve also shown how organizations are using MapReduce for problems such as personalized page rank, software quality assurance and text mining. As more data is collected, MapReduce will continue to provide a method of analyzing these large data in a simplified manner.

Views: 41

Comment

You need to be a member of Online Safety Community to add comments!

Join Online Safety Community

Take our poll!

Take our poll!

Latest Activity

Carl Martens posted a blog post

The Basics Of Peer-To-Peer Recognition

Employees spend a great amount of time at work, perhaps sacrificing personal and family time, and they deserve to be recognized. Fifty-seven percent of companies that implemented a peer-to-peer recognition program saw a significant increase in employee engagement, leading to higher productivity and employee retention rates. The accompanying slideshow details the basics of peer-to-peer recognition and provides tips companies can use to easily implement and maintain such a program. The Basics of…See More
5 hours ago
Russel Stuart posted a blog post

Diversity in the Workplace is all About Embracing Differences

Diversity in the workplace brings different perspectives and thinking into the workplace. It helps employees learn to understand each other better and removes insularity.If there is one mantra that is most commonly associated with the modern workplace, it has to be diversity in the workplace. Diversity in the workplace has gained greater momentum in this age of globalization. World over, there is a tendency to accommodate multiculturalism through a diverse workplace. Diversity at the workplace…See More
11 hours ago
Russel Stuart posted events
12 hours ago
Adam Fleaming posted a blog post

Facts about the (food and beverage) industry that could [surprise] you

Like any other industry, the food and beverage industry too, is unique in its own way, carrying its own set of characteristics. It is a curious mix of the big and the small, not only in terms of the size of companies in it, but also the reach of these companies.In a sense, the food and beverage industry is a global one, if only because almost no part of the world is excluded from it. It can be said to global from another perspective: Many players in the food and beverage industry are real…See More
13 hours ago

Forum

Workplace health and safety Signs 4 Replies

You cannot ignore the importance of safety signs at your workplace. There is no scope to underestimate the important to health and…Continue

Tags: Prohibition, Workplace, Warning, Signs, Safety

Started by healthandsafetysigns. Last reply by Jen McDade on Monday.

Jewellery policy - breakaway pins

Hello,Hopefully someone has come up against this, my employers have implemented a jewellery policy; its a bit over the top but they are keen on enforcing it.As is the tendency with anything…Continue

Tags: watches, heavy machinery, industry, Jewellery

Started by Brent Sep 10.

Secure Yourself From The Impending Danger 3 Replies

The most important thing in life is one’s protection against the impending dangers of life. While working or even at home we all feel the need to be safe because if we have good and safe life, only…Continue

Tags: coverings, Protective

Started by Enna Henry. Last reply by Jen McDade Sep 7.

How To Get The Safest Personal Protective Equipment? 2 Replies

Worried about the safety issues? Feeling a little concerned about the safety factors? Running a company and the safety problems are bothering you? Feel relaxed and look for the best safety solutions…Continue

Tags: equipment, protection, personal

Started by Enna Henry. Last reply by Jen McDade Aug 31.

Safety Books 1 Reply

Please view this brochure and let me know if you are ok with the condition before we can share full books of safety…Continue

Started by Binh, M.S, Grad.IOSH, MIIRSM. Last reply by Jen McDade Aug 31.

Badge

Loading…

© 2018   Created by Safety Community.   Powered by

Badges  |  Report an Issue  |  Terms of Service