Link Analysis
Encyclopedia
In network theory
Network theory
Network theory is an area of computer science and network science and part of graph theory. It has application in many disciplines including statistical physics, particle physics, computer science, biology, economics, operations research, and sociology...

, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organization
Organization
An organization is a social group which distributes tasks for a collective goal. The word itself is derived from the Greek word organon, itself derived from the better-known word ergon - as we know `organ` - and it means a compartment for a particular job.There are a variety of legal types of...

s, people
People
People is a plurality of human beings or other beings possessing enough qualities constituting personhood. It has two usages:* as the plural of person or a group of people People is a plurality of human beings or other beings possessing enough qualities constituting personhood. It has two usages:*...

 and transactions
Financial transaction
A financial transaction is an event or condition under the contract between a buyer and a seller to exchange an asset for payment. It involves a change in the status of the finances of two or more businesses or individuals.-History:...

. Link analysis has been used for investigation of criminal activity (fraud detection, counterterrorism, and intelligence
Intelligence (information gathering)
Intelligence assessment is the development of forecasts of behaviour or recommended courses of action to the leadership of an organization, based on a wide range of available information sources both overt and covert. Assessments are developed in response to requirements declared by the leadership...

), computer security analysis
Computer security
Computer security is a branch of computer technology known as information security as applied to computers and networks. The objective of computer security includes protection of information and property from theft, corruption, or natural disaster, while allowing the information and property to...

, search engine optimization
Search engine optimization
Search engine optimization is the process of improving the visibility of a website or a web page in search engines via the "natural" or un-paid search results...

, market research
Market research
Market research is any organized effort to gather information about markets or customers. It is a very important component of business strategy...

 and medical research.

Knowledge discovery

Knowledge discovery
Knowledge discovery
Knowledge discovery is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data . It is often described as deriving knowledge from the input data...

 is an iterative
Iteration
Iteration means the act of repeating a process usually with the aim of approaching a desired goal or target or result. Each repetition of the process is also called an "iteration," and the results of one iteration are used as the starting point for the next iteration.-Mathematics:Iteration in...

 and interactive process used to identify
Identification (information)
The function of identification is to map a known quantity to an unknown entity so as to make it known. The known quantity is called the identifier and the unknown entity is what needs identification. A basic requirement for identification is that the Id be unique. Ids may be scoped, that is, they...

, analyze and visualize patterns in data. Network analysis, link analysis and social network analysis are all methods of knowledge discovery, each a corresponding subset of the prior method. Most knowledge discovery methods follow these steps (at the highest level):
  1. Data processing
    Data processing
    Computer data processing is any process that a computer program does to enter data and summarise, analyse or otherwise convert data into usable information. The process may be automated and run on a computer. It involves recording, analysing, sorting, summarising, calculating, disseminating and...

  2. Transformation
    Data transformation
    In metadata and data warehouse, a data transformation converts data from a source data format into destination data.Data transformation can be divided into two steps:...

  3. Analysis
    Data analysis
    Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making...

  4. Visualization
    Data visualization
    Data visualization is the study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information"....



Data gathering and processing requires access to data and has several inherent issues, including information overload
Information overload
"Information overload" is a term popularized by Alvin Toffler in his bestselling 1970 book Future Shock. It refers to the difficulty a person can have understanding an issue and making decisions that can be caused by the presence of too much information...

 and data errors. Once data is collected, it will need to be transformed into a format that can be effectively used by both human and computer analyzers. Manual or computer-generated visualizations tools may be mapped from the data, including network charts. Several algorithms exist to help with analysis of data – Dijkstra’s algorithm
Dijkstra's algorithm
Dijkstra's algorithm, conceived by Dutch computer scientist Edsger Dijkstra in 1956 and published in 1959, is a graph search algorithm that solves the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree...

, breadth-first search
Breadth-first search
In graph theory, breadth-first search is a graph search algorithm that begins at the root node and explores all the neighboring nodes...

, and depth-first search
Depth-first search
Depth-first search is an algorithm for traversing or searching a tree, tree structure, or graph. One starts at the root and explores as far as possible along each branch before backtracking....

.

Link analysis focuses on analysis of relationships among nodes through visualization methods (network charts, association matrix). Here is an example of the relationships that may be mapped for crime investigations:
Relationship/Network Data Sources
1. Trust Prior contacts in family, neighborhood, school, military, club or organization. Public and court records. Data may only be available in suspect's native country.
2. Task Logs and records of phone calls, electronic mail, chat rooms, instant messages, Web site visits. Travel records. Human intelligence: observation of meetings and attendance at common events.
3. Money & Resources Bank account and money transfer records. Pattern and location of credit card use. Prior court records. Human intelligence: observation of visits to alternate banking resources such as Hawala.
4. Strategy & Goals Web sites. Videos and encrypted disks delivered by courier. Travel records. Human intelligence: observation of meetings and attendance at common events.


Link analysis is used for 3 primary purposes:
  1. Find matches in data for known patterns of interest;
  2. Find anomalies where known patterns are violated;
  3. Discover new patterns of interest (social network analysis, data mining
    Data mining
    Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

    ).

History

Klerks categorized link analysis tools into 3 generations. The first generation was introduced in 1975 as the Anacpapa Chart of Harper and Harris. This method requires that a domain expert review data files, identify associations by constructing an association matrix, create a link chart for visualization and finally analyze the network chart to identify patterns of interest. This method requires extensive domain knowledge and is extremely time-consuming when reviewing vast amounts of data.

Second generation tools consist of automatic graphics-based analysis tools such as Analyst’s Notebook, Netmap and Watson. These tools offer the ability to automate the construction and updates of the link chart once an association matrix is manually created, however, analysis of the resulting charts and graphs still requires an expert with extensive domain knowledge.

Third generation tools have not yet been produced; this generation is expected to move beyond drawing of links to interpreting links.

Applications


Information overload

With the vast amounts of data and information that are stored electronically, users are confronted with multiple unrelated sources of information available for analysis. Data analysis techniques are required to make effective and efficient use of the data. Palshikar classifies data analysis techniques into two categories – statistical
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

 (models, time-series analysis, clustering
Clustering coefficient
In graph theory, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties...

 and classification, matching algorithms to detect anomalies) and artificial intelligence (AI)
Artificial intelligence
Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...

 techniques (data mining, expert systems, pattern recognition
Pattern recognition
In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...

, machine learning techniques
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

, neural network
Neural network
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...

s).

Bolton & Hand define statistical data analysis as either supervised or unsupervised methods. Supervised learning methods
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

 require that rules are defined within the system to establish what is expected or unexpected behavior. Unsupervised learning methods
Unsupervised learning
In machine learning, unsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution...

 review data in comparison to the norm and detect statistical outliers. Supervised learning methods are limited in the scenarios that can be handled as this method requires that training rules are established based on previous patterns. Unsupervised learning methods can provide detection of broader issues, however, may result in a higher false-positive ratio if the behavioral norm is not well established or understood.

Data itself has inherent issues including integrity (or lack of) and continuous changes. Data may contain “errors of omission and commission because of faulty collection or handling, and when entities are actively attempting to deceive and/or conceal their actions”. Sparrow highlights incompleteness (inevitability of missing data or links), fuzzy boundaries (subjectivity in deciding what to include) and dynamic changes (recognition that data is ever-changing) as the three primary problems with data analysis.

Once data is transformed into a usable format, open texture and cross referencing issues may arise. Open texture
Open texture
Open texture is a term in the philosophy of Friedrich Waismann, first introduced in his paper Verifiability to refer to the universal possibility of vagueness in empirical statements. The concept has become important in criticism of verificationism and has also found use in legal philosophy....

 was defined by Waismann
Friedrich Waismann
Friedrich Waismann was an Austrian mathematician, physicist, and philosopher. He is best known for being a member of the Vienna Circle and one of the key theorists in logical positivism.-Birth & Early Interest in Philosophy:...

 as the unavoidable uncertainty in meaning when empirical terms are used in different contexts. Uncertainty in meaning of terms presents problems when attempting to search and cross reference data from multiple sources.

The primary method for resolving data analysis issues is reliance on domain knowledge
Domain knowledge
Domain knowledge is that valid knowledge used to refer to an area of human endeavour, an autonomous computer activity, or other specialized discipline.Specialists and experts use and develop their own domain knowledge...

 from an expert. This is a very time-consuming and costly method of conducting link analysis and has inherent problems of its own. McGrath et al conclude that the layout and presentation of a network diagram have a significant impact on the user’s “perceptions of the existence of groups in networks”. Even using domain experts may result in differing conclusions as analysis may be subjective.

Prosecution vs. crime prevention

Link analysis techniques have primarily been used for prosecution, as it is far easier to review historical data for patterns than it is to attempt to predict future actions.

Krebs demonstrated the use of an association matrix and link chart of the terrorist network associated with the 19 hijackers responsible for the September 11th attacks by mapping publicly available details made available following the attacks. Even with the advantages of hindsight and publicly available information on people, places and transactions, it is clear that there is missing data.

Alternatively, Picarelli argued that use of link analysis techniques could have been used to identify and potentially prevent illicit activities within the Aum Shinrikyo
Aum Shinrikyo
Aum Shinrikyo was a Japanese new religious movement. The group was founded by Shoko Asahara in 1984. The group gained international notoriety in 1995, when it carried out the Sarin gas attack on the Tokyo subway....

 network. “We must be careful of ‘guilt by association’. Being linked to a terrorist does not prove guilt – but it does invite investigation.” Balancing the legal concepts of probable cause
Probable cause
In United States criminal law, probable cause is the standard by which an officer or agent of the law has the grounds to make an arrest, to conduct a personal or property search, or to obtain a warrant for arrest, etc. when criminal charges are being considered. It is also used to refer to the...

, right to privacy and freedom of association
Freedom of association
Freedom of association is the individual right to come together with other individuals and collectively express, promote, pursue and defend common interests....

 become challenging when reviewing potentially sensitive data with the objective to prevent crime or illegal activity that has not yet occurred.

Proposed solutions

There are four categories of proposed link analysis solutions:
  1. Heuristic-based
    Heuristic
    Heuristic refers to experience-based techniques for problem solving, learning, and discovery. Heuristic methods are used to speed up the process of finding a satisfactory solution, where an exhaustive search is impractical...

  2. Template-based
  3. Similarity-based
  4. Statistical


Heuristic-based tools utilize decision rules that are distilled from expert knowledge using structured data. Template-based tools employ Natural Language Processing (NLP)
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

 to extract details from unstructured data
Unstructured data
Unstructured Data refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well...

 that are matched to pre-defined templates. Similarity-based approaches use weighted scoring
Score (statistics)
In statistics, the score, score function, efficient score or informant plays an important role in several aspects of inference...

 to compare attributes and identify potential links. Statistical approaches identify potential links based on lexical statistics.

CrimeNet explorer

J.J. Xu and H. Chen propose a framework for automated network analysis and visualization called CrimeNet Explorer. This framework includes the following elements:
  • Network Creation through a concept space approach that uses “co-occurrence
    Co-occurrence networks
    Co-occurrence networks are generally used to provide a graphic visualization of potential relationships between people, organizations, concepts or other entities represented within written material...

     weight to measure the frequency with which two words or phrases appear in the same document. The more frequently two words or phrases appear together, the more likely it will be that they are related”.
  • Network Partition using “hierarchical clustering to partition a network into subgroups based on relational strength”.
  • Structural Analysis through “three centrality measures (degree, betweenness, and closeness) to identify central members in a given subgroup. CrimeNet Explorer employed Dijkstra’s shortest-path algorithm
    Dijkstra's algorithm
    Dijkstra's algorithm, conceived by Dutch computer scientist Edsger Dijkstra in 1956 and published in 1959, is a graph search algorithm that solves the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree...

     to calculate the betweenness and closeness from a single node to all other nodes in the subgroup.
  • Network Visualization using Torgerson’s metric multidimensional scaling (MDS)
    Multidimensional scaling
    Multidimensional scaling is a set of related statistical techniques often used in information visualization for exploring similarities or dissimilarities in data. MDS is a special case of ordination. An MDS algorithm starts with a matrix of item–item similarities, then assigns a location to each...

    algorithm.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK