CHAPTER gathered the original datasets on which

CHAPTER
1

Introduction

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

1.1 Overview

In
this century we are surrounded by tremendous quantity of information and data
that means we are at information age. And in this age we have not known that in
what way we organizes our data as it is appears unbounded means there is not
any limit on it. With access to vast volumes of data, decision makers
frequently draw conclusions from data repositories that may contain data
quality problems, for a variety of reasons. In decision-making, data quality is
a serious concern. The incidence of data quality issues arises from the nature
of the information supply chain 1, consumer of a data product may be several
supply-chain steps removed from the people or groups who gathered the original
datasets on which the data product is based.

Figure 1-1.The method of Knowledge
Discovery in databases

These
consumers use data products to make decisions, often with financial and time
budgeting implications. The separation of the statistics buyer from the data
producer creates a situation where the consumer has little or no idea about the
level of quality of the data 2, leading to the potential for poor
decision-making and poorly allocated time and financial resources.

 

 

Figure
1.1 Data Mining Definition

 

1.2           
Motivation

The association rule of data mining is a elementary topic
in mining of data .Association rule mining discovery frequent patterns, associations, correlations, or fundamental structures along
with sets of items or objects in transaction databases, relational
databases ,and
other
information repositories.
The amount of data increasing significantly
as the data generated by
day-to-day
activities.
Therefore ,mining
association rules from
huge amount of data in the data base is concerned form any industries which can
help in many business decision
making processes, such as
cross-marketing, Basket data
analysis, and
promotion assortment. The techniques for discovering association rules
from the data have
conventionally focused
on identifying relationships
between items telling some
feature of human behavior, usually  trade behavior for
determining items
that customers buy together
. All rules of this type
describe
a particular local pattern. The group
of association rules can
be
simply interpreted
and communicated.

 

It is fundamentally important to declare that the prime
key to understand andrealize the data mining technology is the ability to make
different between data mining, operations,applications and techniques 2, as
shown in Fig 1.2

 

A lot of studies
have been done in the region of association rules
mining. First introduced
the association rules mining
in. Many studies have been conducted
to address various conceptual,
implementation, and application
issues
relating to the association
rules mining
task.

 

1.3           
Association
Rulemining

The techniques
for discovering association
rules from the data have conventionally
focused
on identifying relationships
between items telling me feature of human behavior,
usually trade behavior
for determining items that customers
buy together. All
rules of this type describe a particular local
pattern. The group of association
rules can be simply
interpreted
and communicated.

 

The
association rule x?yhas support s in D if the probability of a
transaction in D contains both X and Y is s.

 

The task of mining
association rules is to find all the association rules whose support is larger
than a minimum support threshold and whose confidence is larger than a minimum
confidence threshold 1. These rules are called the strong association rules.

1.4 HADOOP

Hadoop is an open source framework
from Apache and is used to store process and analyze data, which are very huge
in volume. Hadoop runs applications using the MapReduce algorithm, where the
data is processed in parallel with others. In short, Hadoop is used to develop
applications that could perform complete statistical analysis on huge amounts
of data.

Hadoop Architecture

At its core, Hadoop has two major
layers namely:

·        
Processing/Computation
layer (MapReduce),

·        
Storage layer
(Hadoop Distributed File System)

Fig.4 : Hadoop Architechure

1.4.1 MapReduce

To take the advantage of parallel processing of Hadoop, the query
must be in MapReduce form. The MapReduce is a paradigm, which has two phases,
the mapper phase and the reducer phase. In the Mapper the input is given in the
form of key value pair. The output of the mapper is fed to the reducer as
input. The reducer runs only after the mapper is over. The reducer too takes
input in key value format and the output of reducer is final output.

Figure1.5 : Map Reduce flow diagram

1.4.2 Steps in Map Reduce

Map
takes a data in the form of pairs and returns a list of pairs. The keys will not be unique in this case.
Using
the output of Map, sort and shuffle are applied by the Hadoop
architecture. This sort and shuffle acts on these list of pairs and sends out unique keys and a list of values associated
with this unique key .
Output
of sort and shuffle will be sent to reducer phase. Reducer will perform a
defined function on list of values for unique keys and Final output
will will be stored/displayed.

Figure1.6 : Map Reduce Steps

 

1.5 Problem Domain

 

Frequent item set
Mining is  recognized in data mining
field because of its large applications
in association rules mining, correlations
and graph pattern constraint based
on frequent patterns
,sequential patterns
and many other data
mining asks.Well-organized algorithms formining frequent itemsets
are necessary formining association
rules a well as for many
other
data
mining
tasks.

The most
important challenge create infrequent
pattern
miningis alarge
amoun o f resultpatterns.Astheminimumthreshold   becomeslower,anexponentiallyhugenumberofitemsetsaregenerated.Therefore, pruningunimportant
patternscanbedoneefficiently
inminingprocessandthatbecomes
oneofthemost importanttopics in frequentpatternmining.Therefore,themainaimisto optimizetheprocessoffinding
patternswhichshouldbeefficient,scalableandcanclassify
the important patterns which
can beused in different
ways.

 

1.6Aim
& Objectives

The
main objective of research work is to improved  
classical  version of Apriori
Algorithm based on top down approach by using association rule with Hadoop
map-reduce programming by giving them a hands-on experience in developing their
Hadoop based Word-Count application. Hadoop MapReduce Word-Count example is a
standard example. Where in the
rules avoiding generation of un-necessary patterns generates. This improved Apriori
algorithm is used in various type of mining.

Theproblemsorthelimitationsdefinedintheabovesectionofthischapterareproposed
to be solved by:

 

1.     
Installation of Hadoop on Linux
environment for Singe Node.

2.  Implement
Map-Reduce with Word-Count Problem.

3.  Todetect and achieveofvariousaccessiblealgorithmsforminingfrequentitemsets on various
datasets.

4.  To advise a new ideafor mining
the frequent itemsets for trader transactional
databasei.e. forthe aboveproblem.

5.  To validate
thenew scheme on dataset.

 

1.7 Thesis Outline

Thesis organized in following way:

 

Chapter-1
Introduction:  This
chapter deals with all the introductory requirements for understanding the
domain area. It gives the details, which are necessary to understand the work,
and measures its outcomes. It provides the motivations, Background, problems
understanding and a view of proposed solution. This is very first and essential
part of the report, which contains the brief details about theAssociation Rule
with Hadoop.

 

Chapter-2
Literature Review: It
presents a survey on technologies available with the domains. In this a wide
variety of existing mechanism, algorithms and architectures is studied for
identifying the issues removed and remains in Association Rule with Hadooparea.

 

Chapter-3Problem Identification: In
this chapter we identify problem in existing system. Later on, this will give a
brief categorization of various approaches, which has been suggested over the
last few years on Association Rule with Hadoopusing Data mining approaches.

 

Chapter-4Proposed Work:
After
studying the different existing mechanism this identifies the System
Preliminary. It gives a clear understanding the Algorithm with its steps. It
will help the solution to provide better resolution of the current situations
of security.This chapter also gives implementation plan and Testing Strategy of
above security problems by suggesting an architectural solution. Here in this
chapter the implementation of our proposed system will be done. The
implementation is working on which platform, what kind of theme and approach is
followed is referred in this section.

 

Chapter-5Result Analysis:
Developing
a solution is an approach proving mechanism but to prove its results is a
complicated task because it measures each and every step of the solution and
let it compare with the existing mechanisms. Either the proposed system, which
we have implemented, is working properly or not will be discussed in this
section. The results are going to be verified on the basis of the analysis.

Chapter-6Conclusion and Future Work: This
chapter gives concluding remarks on the dissertations and gives a final
analysis and comparisons along with some future directions of the work. The
future scope and the short summary will be discussed. It gave an idea how we
can expand the work in future which we have performed in this report.

 

 

 

1.8 Summary

            This
chapter allocates with all the introductory requirements for understanding the
domain area. It gives the details, which are necessary to understand the work,
and measures its outcomes. It provides the motivations, background, problems
understanding and a view of proposed solution.