International Journal of Computer Science: Theory and Application
Figure 1 shows the different components of our framework.
In the component “Component 1” of this framework, we
consider building for each user a profile based only on the
hashtags she/he cited. This profile is complementary to the
FOAF profile. By constructing a hashtag-based profile, we
mean that the different significant tokens that compose a
hashtag should be extracted and added to the user’s profile
(cf. section 4). The component “Component 2” concerns the
semantic similarity measures between profiles that allows
producing a similarity matrix. Each element of the similarity
matrix contains a measure of similarity between two profiles
(cf. section 5). In “Component 3” we apply a clustering
algorithm in order to produce a set of clusters each containing
a set of semantically related profiles (cf. section 6). These
clusters are the basis of our recommender system, i.e. this
gives the possibility to recommend to each user some
potential relationships from the cluster she/he belongs to.
4. Hashtag Segmentation
A common practice in current social networks is to identify
the subjects of a post by means of hashtags, e.g.,
#Mancherster, #LiesPeopleAlwaysTell, #toobad, #ff,
#skypeisnotworkingagain [6].
As defined above, a hashtag is a word or an un-spaced phrase
prefixed with the hash character. Hence, a hashtag can be
made up of one, two, or more words. In order to use a
hashtag, it should be decomposed into its composing words.
As much as the number of words increases as much as the
complexity of this hashtag and the difficulty of segmenting it
into the exact composing words increase. For instance,
suppose that we have the hashtag #dependentrelationship,
this hashtag can be split as dependent relations hip, as
dependent relationship, or as dependent relation ship. How to
decide what is the right or the most likely segmentation?
Same problem arises with the hashtag #airportend that can be
split as air portend, or as airport end, and also as air port
end.
In our work, we developed a segmentation algorithm that
proceeds on two main steps:
1. The first step uses an English lexicon to find all the
possible sequences of words that may compose a
hashtag. For example, the hashtag
#throwbackthursday has two lexically correct
sequences:
throwback thursday
throw back thursday
This is a lexical step that allows eliminating any
segmentation with invalid words, i.e. not found in
the dictionary. To accomplish this step, we used the
English Lexicon Project
made by Washington
University consisting of 80000 words [7]. Note that
sometimes the hashtag itself is a valid word in the
dictionary and added as possible segmentation, for
example the hashtag #worldwide has two possible
segmentations, according to the dictionary: world
wide and worldwide. In this case, we choose the
single word as the right segmentation.
English Lexicon Project : elexicon.wustl.edu
2. If at least two possible segmentations arise from the
first step, we proceed with a disambiguation step in
order to find the most probable sequence of words.
We developed a probabilistic model based on
bigram frequencies. Note that an n-gram is a
contiguous sequence of n items from a given
sequence of text or speech [8]. The items can be
phonemes, syllables, letters, or words according to
the application. In our context, we consider word
items. An n-gram of size 2 (n=2) is a bigram.
Several corpuses exist and provide bigram
frequency counts. We used the bigram list provided
by the Corpus of Contemporary American English
(COCA)
. For each bigram in this list, we computed
its probability representing how much this bigram is
likely to appear in an English sentence.
To find the most probable segmentation of a
hashtag, we consider that each generated
segmentation is represented by a path in a Markov
model. We select the segmentation with the highest
path probability, i.e. the highest product of
probabilities along the path.
Consider the hashtag #worldwidefestival in order to
illustrate this step.
The lexical segmentation step produces the
following possibilities:
worldwide festival
world wide festival
The segmentation worldwide festival has the bigram
probability of 0.0022. The probability of
segmentation world wide festival is equal to
probability of the bigram world wide multiplied the
probability of the bigram wide festival which is 0.05
x 0.0099 = 0.00049. Hence, the segmentation
worldwide festival is produced.
To evaluate the hashtag segmentation algorithm, we selected
the top 387 hashtags trending on social networks in January
2015. We performed an offline segmentation leading to
97.9% success rate. This means that only 8 hashtags are not
correctly segmented. Looking in details, we noticed that the
corresponding bigrams of 3 hashtags are not found in the
COCA corpus. In the other 5 cases, the lexical step failed
because the hashtag words are not found in the used English
dictionary.
5. Profiles Matching
Given the set of cited hashtags of a user, we are now able to
derive her/his profile consisting of the different significant
words composing these hashtags (cf. section 4). In this
section, we show our profiles matching algorithm used to
determine whether or not any two profiles share common
topics of interest. The algorithm we propose is a generic
matching algorithm that measures the semantic similarity
between any two profiles. It is generic because it is designed
to measure the similarity between any two set of words, not
necessarily user’s profiles. Such an algorithm could be used
to extend our framework to images and videos
http://www.ngrams.info/