Get fresh music recommendations delivered to your inbox every Friday. No Way Out But Through by face to face. Melodic punk rock at its finest. Favorite track: Farewell Song.
MK Downing. MK Downing I've always liked this band, the guy has a great voice. I don't know shite about recording, but this CD sounds gr8. This is album is amazing and I'm finding that harder and harder to say these days. Christopher Barrett. Vince Sadonis. Justin Von Strasburg. Ed den Braasem. Grumpy Old Man. Garrett Keith. Adam Dan Blake. Nick Moncada. Ian Moore. Terry Cameron. Sigmund Nordland. Matthew McNeal.
Dave Miller. Thomas Shields. Nicholas Kent. Chris Horgan. Andy Flaim. Frandroid Atreides. Dustin Heath. Chase the Spirit feat. Jimetta Rose. Darknesses Interlude. Ginkabiloba feat. Taz Arnold. Little Dominiques Nosebleed Part 2. A Bitch Once Told Me feat. No Llores feat. Attention Challenge feat. The World's Smallest Violin feat. We All Want Something feat. Anna Wise. Lap of Luxury feat. Little Dominiques Nosebleed Outro. Real Bad Boldy. Light Bill feat. Meyhem Lauren.
Study Da Champ feat. Super Soul Brother. Just Ice prod. Truth Hurts feat. DJ Truly Odd prod. Pomme Frites feat. DJ 2Buck prod. Vital prod. Ringing Bells prod. Problem Solved prod. Mali Freestyle prod. Be Wise prod. Shine prod. Wise Intelligent prod. Made From The Beginning prod.
Venerated feat. Melanin 9 prod. DJ Muggs - Winter. Warning Shots feat. Boldly James. Battery acid burning yearning Determination feat. Buttons attached, buckles yoked Come unity feat. Blown away by youngsters dis a peer feat. Barbecues and back yards Live at the BBQ feat. Planet Asia - Bodhidha. Crush Control feat. Dudley Perkins. Sunday Driver feat. Wookie feat. Sour Salt feat. Five Star feat. Trouble Double feat. Direct Message feat. Control Crush. Allmylife TF. Dump YOD Intro.
New Religion feat. Odessa feat. Malchishka Krutoy feat. Babushka III. Uzbekistan feat. Pravda feat. Dump YOD Outro. Cruise Control feat. The Musalini.
Hunger Games feat. Reasons feat. Divine Knowledge. Moment In The Sun feat. Grace feat. Amber Simone. Zaaanie feat. King Harris. OxStanding feat. Run It feat. So Good feat. Hold You Down feat. Jelly Roll. Just 4 A Min feat. Junell Beats. Out Of The Blue feat. We The People feat. JDRenior of Kings Drive. Plandemic Bars feat. Immortal feat. Elcamino x 38 Spesh - Sacred Psalms. Hammers On The Hip. Blaksmif - Platforms X Trophies.
Platforms X Trophies. Gold Metals feat. Blaze Gee. More or Les - The Human Condition. Living On Mars feat. Relax feat. DJ Unknown. Nostalgia Pt. You're Right! Always Smile in Pics [feat. Overwhelmed Pick Your Battles [feat. Mikal kHill]. Uncanny Valley feat. Fri endship [feat. Progress feat. Statik Selektah - The Balancing Act. The Healing feat.
Black Thought. Keep It Moving feat. Play Around feat. America Is Canceled feat. No Substitute feat. Off My Mind feat. Welcome To The Game feat.
No More feat. White Horse — White Horse. My Life at Denny's. Late Night Truck Stop Cypher feat. Southpaw Chop — Far East Quality. Sunday Morning Drive feat. Vicious Cut Ya Ass Up David Instrumental Sophisticated Grunge Barbara Home ResoNate Bear With Me feat. Notorious Grizzz Games We Play Who Is This Graveyard Gravitas Ellion Instrumental Shirley No More Nice Guys Beef Thunderbird Weapon E.
Uptown X. One Minute Alert. OG Remix feat. Rolling 50 Deep feat. We Get Busy feat. Wired Different Streets Made Me An Offer Social Media Treachery feat. Rome Streetz Precious Poems Cold feat. SmooVth We The Enemy Embrace The Pain What God Made Me feat. Melanie Rutherford.
Magnetic Wisdom feat. Dart Assassins feat. Vinny Vindetta prod. Code Of War feat. Daniel Son prod. Supremacy feat.
Cranial Disorder feat. Journey Into Zion prod. Long Division feat. Lab Test feat. Broken Promises feat. Demigods feat. Wyld Bunch x Milkcrate - M. Vigilante Breakfast feat. Strap Yourself Down. Czar feat. Outta My Mind feat. Bell Biv DeVoe. Minister Louis Farrakhan. Slow Flow feat. Ol' Dirty Bastard. Don't Go feat. Master Fard Muhammad feat. Rick Ross. YUUUU feat. Vybz Kartel. Best I Can feat. Where I Belong feat.
Mariah Carey. The Young God Speaks. Look Over Your Shoulder feat. Kendrick Lamar. Mary J. Nikki Grier. Before The Streets Lock Down Normal White Power Summer Of Sam Quarantine feat. Kharmony Fortune Triggered feat. Snoop Dogg Powder Keg Loaded Man Down feat. War In The Streets I Feel Great feat. DJ Quik Catch A Body. Black Thought - Streams of Thought Vol.
Realio Sparkzwell - Splashwork Yang. Realio Sparkzwell - Bloody Luciano. Reks — T. Myka 9 - Stay Tuned. Cappadonna - Show Me The Money. Crimeapple — Jaguar On Palisade. Pain In My Life, Pt. Marsha Ambrosius. Holy War feat. Recognize Or Die feat. By Karnate. Elephants In The Room feat. Death Itself feat. By Vago. The average accuracy for the six cases of order 0 is Removing the punctuation affects the classification accuracy.
The average accuracy with punctuation is Vocabulary pruning and stemming boost the performance and the best result is The average accuracy for different orders is quite similar while order 2 improves the accuracy by 0.
Removing the punctuation degrades the performance by 0. Vocabulary pruning and stemming help to strengthen the result and the best result is For scam-ham data set, all the experiments achieve very good accuracies and the worst accuracy is Removing punctuation degrades the result from The best result is Clearly, this is an imbalanced performance.
This may due to the insufficient amount of training data to support higher order models. Also, when collecting the data, all emails from a student selected to be the deceiver were labeled as deceptive and emails from the other one were labeled as truthful. However, the students acting as deceivers may not deceive in each email in reality.
This could have corrupted the DSP data set. For the phishingham data set, the detection rate varies within a small range. For order 2, the results for all the six cases are quite close and indicate that the preprocessing procedure plays only a minor role when using a higher model order.
For the scam-ham data set, the NOP procedure results in a lower false positive rate while a lower detection rate is also achieved compared to other preprocessing procedures. From these results, Applicants conclude that word-based PPMC models with an order less than 2 are suitable to detect deception in texts and punctuation indeed plays a role in detection.
In addition, applying vocabulary pruning and stemming can further improve the results on DSP and phishing-ham data sets. Stemming and vocabulary pruning mitigate the sparsity and boost the performance.
For scam-ham data set, the size is relatively large and therefore stemming and vocabulary pruning do not influence the performance.
From the table, Applicants observe that, at the character-level, order 0 is not effective to classify the texts in all the three data sets. Punctuation also plays a role in classification while removing the punctuation degrades the performance in most of the cases.
Increasing the order number improves the accuracy. For the DSP data set, although the accuracy increases for order 4, the detection rate decreases at the same time and this makes the detection result imbalanced. Thus, for the DSP data set, orders higher than 2 are unsuitable for deception detection. This may be due to the insufficient amount of training data to justify complex models. For the phishing-ham and scam-ham data sets, higher model orders achieve better results in most cases.
From the result of the scam-ham data set, when a sufficient amount of training data can be achieved, higher order PPMC will get better performance. However, higher order models request larger memory and longer processing time. The results show that the processing time for the higher orders is much longer than that of lower orders. Processing time for email without punctuation is slightly smaller than that of the original email since NOP will reduce the length of the email and number of items in the model note.
The experimental results are presented in table 3. The detection rate and false positive are shown in FIG. Gzip has a very poor result in DSP. It has very high detection rate in trade off high false positive. The punctuation in DSP does not plan a role in detection. For phishingham and scam-ham, the performance of Gzip and RAR are closed.
Gzip in original data achieves the best result. Getting rid of the punctuation degrades the results. One drawback of AMDL is the slow running time. Here we show the running time of testing a single scam email in table 3. Among the three methods, Bzip2 costs the shortest time while RAR spends the longest time in compression.
For the detection system which speed is important, the AMDL is unsuitable. As noted above, an embodiment of the present disclosure investigates compression-based language models to detect deception in text documents. Compression-based models have some advantages over feature-based methods. PPMC modeling and experimentation at word-level and character-level for deception detection indicate that word-based detection results in higher accuracy.
Punctuation plays an important role in deception detection accuracy. Stemming and vocabulary pruning help in improving the detection rate for small data sizes. To take advantage of the off-the-shelf compression algorithms, an AMDL procedure may be implemented and compared for deception detection. Applicants' experimental results show that PPMC in word-level can perform better with much shorter time for each of the three data sets tested.
Applicant's have proposed several methods for deception detection from text data above. This online detection tool can be used by anyone who can access the Internet through a browser or through the web services and who wants to detect deceptiveness in any text. On the online tool website, the users can type the content or upload the text file they want to test. The user then clicks the validate button, then the cue extraction algorithm and SPRT algorithm written in Matlab will be called by TurboGears and Python.
After the algorithms are executed, the detection result, trigger cue and deception reason will be shown on the website. If the users are sure about the deceptiveness of the content, they can give the website feedback on the result, which, if accurate, can be used to improve the algorithm based upon actual performance results.
Alternatively, users can indicate that they are not sure, if they do not know whether the content is deceptive or truthful. In accordance with an embodiment of the present disclosure, to implement the SPRT algorithm, the cues' value should be extracted first. To extract the psycho-linguistic cues, most of the time, each word in the text must be compared with each word in the cue dictionary.
This step uses most of the implementation time. Applicants noticed that most of the texts only need less than 10 cues to determine deceptiveness. In order to make the algorithm more efficient, in accordance with an embodiment of the present disclosure, the following efficient SPRT algorithm may be used:. The phishing-ham email data sets are used to get the cues' PDF. TABLE 4. From table 4.
In order to check the validity and accuracy of the algorithms proposed and the online tool, three cases were studied. They related to phishing emails, tracing scams, and webcrawls of files from Craigslist. To test Applicants' cues extraction code, the phishing and ham data set mentioned above may be used. The detection results were measured using the fold cross validation in order to test the generality of the proposed method.
The overall accuracy is the percentage of emails that are classified correctly. It shows that the algorithm worked well on phishing emails. Because no deceptive benchmark data set is publicly available, for the online tool, the phishing and ham emails obtained here were used to obtain the cue values' probability density functions.
A known website, as discussed in , June Thousand dollar bill. The emails are of the type that promise rewards if you forward an email message to your friends. The emails said you will get rewards if you forward an email message to your friends. The rewards include cash from Microsoft, free computer from IBM, and so on.
The named companies have indicated that these emailed promises are email scams, and they did not send out these kinds of emails. The foregoing website features 35 scam emails. After uploading all 35 scam emails to the Applicants' online tool, 33 of them are detected as deceptive.
Another website, , April Scam or roma. These two cases show that our online tool is applicable for tracing scams. In order to effectively detect hostile content on websites, the deception detection algorithm of an embodiment of the present disclosure is implemented on system with architecture shown in as seen in FIG. A web crawler program is set to run on public sites such as Craigslist to extract text messages from web pages.
These text messages are then stored in the database to be analyzed for deceptiveness. The text messages from the Craiglist are extracted and the links and hyperlinks are recorded in the set of visited pages.
In experimentally exercising the system of the present disclosure, 62, files were extracted, and the above-described deception detection algorithm was applied to them.
Although the ground truth of these files was unknown, the discovered percentage or deceptive rate in Craigslist appears reasonable.
The three data sets described above were combined to develop training model, then a fusion rule was applied on the detection result. If both methods detect it as normal, the result is shown as normal. If any of the algorithms indicate text is deceptive, then the result is deceptive. Using this method, a higher detection rate may be achieved with a trade off of experiencing a higher false positive rate. With the rapid development of computer technology, email is one of the most commonly used communication mediums today.
Trillions of activities are exchanged through email each day. Clearly, this presents opportunities for illegitimate purposes. In many misuse cases, the senders attempt to hide their true identities to avoid detection, and the email system is inherently vulnerable to hiding a true identity. Successful authorship analysis of email misuse can provide empirical evidence in identity tracing and prosecution of an offending user.
Compared with conventional objects of authorship analysis, such as authorship identification in literary words of published articles, authorship analysis in email has several challenges, as discussed in 0.
A, August , the disclosure of which is hereby incorporated by reference. First, the short length of the message may cause some identifying features to be absent e. Second, the number of potential authors for an email could be large. Third, the number of available emails for each author may be limited since the users often use different usernames on different web channels. Fourth, the composition style may vary depending upon different recipients, e. Fifth, since emails are more interactive and informal in style, one's writing styles may adapt quickly to different correspondents.
However, humans are creatures of habit and certain characteristics such as patterns of vocabulary usage, stylistic and sub-stylistic features will remain relatively constant. This provides the motivation for the authorship analysis of emails. In recent years, authorship analysis has been applied to emails and achieved significant progress. In previous research, a set of stylistic features along with email-specific features were identified and supervised machine learning methods as well as unsupervised machine learning approaches have been investigated.
A, August ; 0. Vel, A. Anderson, M. Corney, and G. Corney, A. Anderson, G. Mohay, and 0. From this research, 20 emails with approximately words each are found to be sufficient to discriminate authorship. Computational stylistics was also considered for electronic messages authorship attribution and several multiclass algorithms were applied to differentiate authors, as discussed in S.
Argamon, M. Saric, and S. A, ,the disclosure of which is hereby incorporated by reference. Goodman, M. Hahn, M. Marella, C. Ojar, and S. A framework for authorship identification of online messages was developed in R. In this framework, four types of writing-style features lexical, syntactic, structural, and content-specific features are defined and extracted. Inductive learning algorithms are used to build feature-based classification models to identify authorship of online messages.
Ceesay, 0. Alonso, M. Gertz, and K. Because the authors of the phishing emails are unknown and can be from a large number of authors, they proposed methods to cluster the phishing emails into different groups and assume that emails in the same cluster share some characteristics, and it is more possibly generated from the same author or same organization.
The methods they used are k-Means clustering unsupervised machine learning approach and hierarchical agglomerative clustering HAC. A new method called frequent pattern is proposed on the authorship attribution in Internet Forensic, as discussed in F. Iqbal, R. Hadjidj, B.
Fung, and M. SS51, , the disclosure of which is hereby incorporated by reference. Previous work has mostly focused on the authorship identification and characterization tasks while very limited research has focused on the similarity detection task. Since no class definitions are available before hand, only unsupervised techniques can be used. Principal component analysis PCA or cluster analysis, as discussed in A.
Abbasi and H. Then an optimal threshold can be compared with the score to determine the authorship. Due to the short length of emails, large pool of the potential authors and small number of emails for each author, to achieve high a level of accuracy in similarity detection is challenging even impossible. They investigated a rich stylistic feature set including lexical, syntactic, structural, content-specific and idiosyncratic attributes.
They also developed a writeprints technique based on KarhunenLoeve transform for identification and similarity detection. In accordance with an embodiment of the present disclosure, the Applicants address similarity detection on emails at two levels: identity level and message-level.
Applicants use a stylistic feature set including features. A new unsupervised detection method based on frequent pattern and machine learning methods is disclosed for identity-level detection.
A baseline method principle component analysis is also implemented to compare with the disclosed method. For message-level, first, complexity features which measure the distribution of words are defined. Then, three methods are disclosed for accomplishing similarity detection. Testing which evaluated the effectiveness of the disclosed methods using the Enron email corpus is described below. There is no consensus on a best predefined set of features that can be used to differentiate the writing of different identities.
The stylistic features usually fall into four categories: lexical, syntactical, structural, and content-specific, as discussed in R. Lexical features are the characteristic of both characters and words.
For instance, frequency of letters, total number of characters per word, word length distribution, words per sentence are lexical features. Totally, 40 lexical features which were used in many previous research are adopted. Syntactical features including punctuation and function words can capture an author's writing style at the sentence level.
In many previous authorship analysis studies, one disputed issue in feature selection is how to choose the function words. Due to the varying discriminating power of function words in different applications, there is no standard function word set for authorship analysis.
In accordance with an embodiment of the present disclosure, instead of using function words as features, Applicants introduce new syntactical features which compute the frequency of different categories of function words in the text using LIWC. LIWC is a text analysis software program to compute frequency of different categories. Unlike function word features, the features discerned by LIWC are able to calculate the degree to which people use different categories of words.
These kinds of features will help to discriminate the authorship since the choice of such words is a reflection of the life attitude of the author and usually are generated beyond an author's control. Applicants adopted 44 syntactical LIWC features and 32 punctuation features in a feature set.
Combining both LIWC features and punctuation features, there are 76 syntactical features in one embodiment of the present disclosure. Structural features are used to measure the overall layout and organization of text, e. A, August , 10 structural features are introduced. Here we adopted 9 structural features in our study. Content-specific features are a collection of important keywords and phrases on a certain topic.
It has been shown that content-specific features are important discriminating features for online messages R. For online messages, one user may often send out or post messages involving a relatively small range of topics.
Thus, content-specific features related to specific topics may be helpful in identifying the author of an email. Furthermore, since an online message is more flexible and informal, some users like to use net abbreviations. For this reason, the Applicants have identified the count of the frequency of net abbreviations used in the email as a useful content-specific feature for identification purposes. In accordance with one embodiment of the present disclosure, stylistic features have been compiled as probative of authorship.
Table 5. TABLE 5. Because of privacy and ethical consideration, there are not many choices of the public available email corpus. Enron was an energy company based in Houston, Tex.
Enron went bankrupt in because of accounting fraud. During the process of investigation, the emails of employees were made public by the Federal Energy Regulatory Commission. Here we use the Mar. This version of Enron email corpus contains , emails from users, mostly senior management.
The emails are all plain texts without attachments. Topics involved in the corpus include business communication between employees, personal chats between families, technical reports, etc. From the authorship aspect, we need to make sure the author of each email. Thus the emails in the sent folders including. Since all users in the email corpus were employees of Enron, the authorship of the emails can be validated by the name.
For each email, only the body of the sent content was extracted. The part of email header, reply texts, forward, title and attachment and signature were removed. All duplicated or carbon copied emails were removed. Since ultra-short emails may lack enough information and the length of emails are commonly not ultra-long, the emails less than 30 words were removed.
Also, given the number of emails of each identity needed to detect authorship, only those authors having a certain minimum number of emails were chosen from the Enron email corpus.
In accordance with one embodiment of the present disclosure, a new method to detect the authorship similarity at the identity level based on the stylistic feature set is disclosed. As mentioned above, for similarity detection, only unsupervised techniques can be used.
Due to the limited number of emails for each identity, traditional unsupervised techniques, such as PCA or clustering methods may not be able to achieve high accuracy. Applicants proposed method based on established supervised techniques will help adducing the depth of similarity between two identities.
An intuitive idea of comparing two identities' emails is to capture the writing pattern of two identities and find how much they match. Thus, the first step in Applicants' learning algorithm is called pattern match. By matching the writing pattern of two identities, the similarity between them can be estimated. To define the writing pattern of an identity, we borrow the concept of frequent pattern, as described in R.
Agrawal, T. Imielinski, and A. Developed in data mining area. Frequent pattern mining has been shown successful in many applications of pattern recognition, such as market basket analysis, drug design, etc. Before describing the frequent pattern, the encoding process to get the feature items will first be described. The features extracted from each email are numerical values. To convert them into feature items, Applicants discretize the possible feature values into several intervals according to the interval number v.
Then for each feature value, a feature item can be assigned to it. For example, if the maximum value of feature f 1 could be 1 and the minimum value could be 0, then the feature intervals will be [ Supposing the f 1 value is 0. The 1 in f 12 is the index order of the feature while the 2 is the encoding number.
For the feature value which is not in [0,1], a reasonable number will be chosen as the maximum value. A pattern that contains k feature items is a k-pattern. For the authorship identification problem, the support of F is the percentage of emails that contains F as in equation 5. Given two identities' emails and setting up the interval number v, pattern order k and minimum support threshold t, the frequent pattern of each identity can be computed. Then the pattern match is to find how many common frequent patterns among them and then a similarity score SSCORE is assigned to them as equation 5.
In this example, the number of common frequent pattern is 3. Although different identities may share some similar writing patterns, Applicants propose that emails from the same identity will have more common frequent patterns. Another aspect of Applicants' learning algorithm is style differentiation. In the previous description, the similarity between two identities was considered.
Now, methods of differentiating between different identities will be considered. It has been shown that approximately 20 emails with approximately words in each message are sufficient to discriminate authorship among multiple authors in most cases, as described in M.
To attribute an anonymous email to one of two possible authors, we can expect that the required number of emails from each identity may be less than 20 and the message can be shorter than words. Since authorship identification using supervised techniques has achieved promising results, an algorithm in accordance with one embodiment of the present invention can based on this advantage.
In style differentiation, given n emails from author A and n emails from author B, the objective is to assign a difference score between A and B. Assuming a randomly picked email from these 2n emails, i.
However, when A and B are from the same person, even very good identification techniques cannot achieve high accuracy. To assign an email to one of two groups of emails generated by the same person, the result will have an equal chance of showing that the test email belongs to A or B. Therefore, the accuracy of identification will reflect the difference between A and B. This is a motivation for Applicants' proposed style differentiation step.
To better assess the identification accuracy among 2n emails, leave-one-out cross validation is used and the average correct classification rate is computed. An algorithm in accordance with one embodiment of the present disclosure can be implemented by the following steps:. Step 1: Get two identities A and B , each with n emails, extract the features' values. Step 2: Encode the features' values into feature items. Compute the frequent pattern of each identity according to the minimum support threshold t and pattern order k.
Step 3: Compute the correct identification rate R using leave one out cross validation and machine learning method e. After running 2n comparisons, the correct identification rate. Step 5: Set a threshold T, and compare S with T. The above method is an unsupervised method, since no training data is needed and no classification information is known a priori. The performance will depend on the number of emails each identity has and the length of each email.
They are all well established and popular machine learning methods. KNN k-nearest neighbor classification is to find a group of k objects in the training set, which are closest to the test object. Then the label of the predominant class in this neighborhood will be assigned to the test object.
The KNN classification has three steps to classify an unlabeled object. First, the distance between the test object to all the training objects is computed. Second, the k-nearest neighbors are identified.
Third, the class label of the test object is determined by finding the majority labels of these nearest neighbors. Decision tree and SVM, has been described above. For SVM, several different kernel functions were explored, namely, linear, polynomial and radial basis functions, and the best results were obtained with a linear kernel function, which is defined as:.
To evaluate the performance of the algorithm, PCA is implemented to detect the authorship similarity. PCA is an unsupervised technique which transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components by capturing essential variance across a large number of features.
PCA has been used in previous authorship studies and shown to be effective for online stylometric analysis, as discussed in A. In accordance with one embodiment of the present disclosure, PCA will combine the features and project them into a graph. The geographic distance represents the similarity between two identities' style. The distance is computed by averaging the pair wise Euclidean distance between two styles and an optimal threshold is obtained to classify the similarity.
Before considering the predicting results, selected evaluation metrics will be defined: re-call R , Accuracy and F 2 measure. Recall R is defined as. The Accuracy is the percentage of identity pairs that are classified correctly and.
As mentioned above, only a subset of the Enron emails will be used, viz. For each author, 2n emails are divided into 2 parts, each part having n emails. In total, there are 2m identities each with n emails. To test the detection of same author, there are m pairs. To test the detection of different authors, for each author, one part n emails is chosen and compared with other authors.
There are then. Since the examples in the different authors case and in the same author case are not balanced,. The number of total authors m, the number of emails n and the minimum words each email has min wc are changed to see how they influence the detection performance. Three methods, KNN, decision tree and SVM are used as the basic machine learning method separately in the style differentiation step.
For the decision tree, Matlab is used to implement the tree algorithm and the subtrees are pruned. For the SVM, linear kernel function is used. Because the detection result depends on the chosen of threshold T, different T will get different results.
To compare the performance of different methods, for each test, T is chosen to get the highest F 2 value. PCA is also implemented and compared with Applicants' method. Applicants' method outperforms PCA in all the cases. For the proposed method, using SVM and decision tree as the basic method, increasing the number of emails n will improve the performance. Also, increasing the length of the emails will lead to better results. The following tests also use SVM in step 3. To examine the generality of Applicants' method, Applicants compared the detection result using different numbers of authors m and different pattern order k.
As shown in FIG. The number of pattern order k does not significantly influence the result. Changing a value leads to different results, but it does not vary much since a different optimal threshold T will be used to achieve the best F 2 result.
The detection result with different author number is similar. Message-level analysis is more difficult than identity-level analysis because usually only a short text can be obtained for each author. The challenge in detecting deception is how to design the detection scheme and how to define the classification features. In accordance with one embodiment of the present disclosure, Applicants describe below the distribution complexity features which consider the distribution of function words in a text.
Several detection methods will described pertaining to message-level authorship similarity detection and the experiment results will be presented and compared. Stylistic cues, which are the normalized frequency of each type of words in the text, are useful in the similarity detection task at the identity-level. However, using only the stylistic cues, the information concerning the order of words and their position relative to other words is lost.
For any given author, how do the function words distribute in the text? Are they clustered in one part of the text or are they distributed randomly throughout the text? Is the distribution of elements within the text useful in differentiating authorship?
Spracklin, D. Inkpen, and A. In this section, we will consider the distribution complexity features. Since similarity detection at the message-level is difficult, Applicants propose that adding the complexity features will give more information about authorship. Kolmogorov complexity is an effective tool to compute the informative content of a string s without any text analysis, or the degree of randomness of a binary string, denoted as K s , which is the lower bound limit of all possible compressions of s.
Due to the incomputability of K s , every lossless compression C s can approximate the ideal number K s. Many such compression programs exist. For example, zip and gzip utilize the LZW algorithms. Bzips uses Burrows-Wheeler transforms and Huffman coding. To measure the distribution complexity features words, a text is first mapped into a binary string. Then a text will be mapped into a binary string containing the information of distribution of article words. The complexity is then computed using equation 5.
In the present problem, nine complexity features will be computed for each email, including net abbreviation complexity, adpositions complexity, articles complexity, auxiliary verbs complexity, conjunctions complexity, interjections complexity, pronouns complexity, verbs complexity and punctuation complexity.
To compute each feature, the text is first mapped into a binary string according to each feature's dictionary. Then the compression algorithm and equation 5. Because no authorship information is known a priori, only unsupervised techniques can be applied in similarity detection. Furthermore, since only one sample is available for each class, traditional unsupervised techniques, such as cluster, is unsuitable to solving the problem.
Several methods to detect the authorship similarity detection at the message-level are described below. Given two emails, two cue vectors can be obtained. Applicants inquire as to whether it is possible to take advantage of these two vectors to determine the similarity of the authorship?
A naive approach is to compare the difference between two emails. The difference can be expressed by the distance between two cue vectors. Since the cues' values are in different scales, before computing the distance, the cues' values are normalized using equation 5. After normalization, all the cue values will be between [0,1]. Where Xi is the value of ith cue, X i min and X i max are the minimum and maximum value of ith cue in the data set.
Then the Euclidean distance in 5. Usually, when two emails are from the same author, it will share some features. If we consider the difference between two feature vectors, for the emails from the same author, some variables' difference in two emails should be very small. While for different authors, the variables' difference might be larger.
The difference will reflect in the distance. From this point, the distance can be used to detect similarity. The Euclidean distance will then be compared with a threshold to determine authorship. Since the difference of two cue vectors reflects the similarity of the authorship, if the difference in each cue as a classification feature is considered, advantage can be taken of promising supervised classification methods.
For each classification, the difference vector C in equation 5. If many email pairs in the training data are used to get the classification features, then some properties of the features can be obtained and used to predict the new email pairs.
Applicants propose using two popular classifiers, SVM and decision tree, as the learning algorithm. Unlike the Euclidean distance method, training data set is required to train the classification model by using this supervised classification method. Since the classification feature is the difference between two emails in the data set, the diversity of the data set will play an important role in the classification result.
For example, if the data set only contains emails from 2 authors, then no matter how many samples we run, the task is to differentiate emails between two authors. In this instance, a good result can be expected.
0コメント