Academic work edited
Below is an example of a previous text which I edited, used with permission of the author. The corrections and adjustments are tracked using “track changes” in Microsoft Word. I sent the following note to the author:
Hi xxxxxxx, second revised version attached. Notes on this revision:
• I have consistently hyphenated used the term 'section-based' rather than ‘sections based’, and also consistently hyphenated DBSCAN-based and CPA-based.
• I have consistently used 'automated' as opposed to 'automatic' evaluations, and sometimes 'automated instrument'
• I have consistently italicized dataset_1 and dataset_2, and never referred to them as 'the dataset_1'
• In 3.3.1 I wasn't sure if you meant "number of minimum points" or minimum number of points. I have left the original in place.
• I suggest using simply 'standard dataset' instead of 'gold standard dataset', but have not changed it as yet in case 'gold standard' is industry jargon.
• I note that you speak of Figure 2.1 and Table 2.1. I'm not sure if these refer to the same thing, but if they do it would be good to use the same word.
Regards,
Chris Kerwick
Chapter 1
The immense proliferation of research papers in journals and conferences has poseds challenges for researchers wanting to easily access relevant scholarly papers. Recommender systems offer a solution to this dilemma by filtering all of the available information and delivering what is most relevant to the user.
Several approaches have been proposed for research paper recommendation, variously that include the approaches based on metadata, content, citation analysis, collaborative filtering, etc. The aApproaches predicated on the citation analysis, including co-citation analysis and bibliographic coupling, have proven to be significant. These include the co-citation analysis and bibliographic coupling. Co-citation has been analyzed at content level and the use of citation proximity analysis has shown significant improvement in accuracy. However, co-citation presents the relationship between two papers based on their co-occurrences in otherhaving been mutually cited by other papers, without considering the contents of the cited citing papers. Whereas Bibliographic Bibliographic coupling, on the other hand, considers two papers as relevant if they share common references, . Therefore, Bibliographic Coupling has inherited benefits of recommending relevant papersbut; however, the traditionally Bibliographic Coupling does not consider the citing patterns of common references in different logical parts of the citing papers.
The improvement found in cases of co-citation by usingwhen combined with the content analysis, motivated us to analyze the impact of using the proximity analysis of in-text citations in cases of bibliographic coupling. Therefore, in this research, three different approaches were proposed that extended the bibliographic coupling by exploiting the proximity of in-text citations of bibliographically coupled articles. These approaches are: (1) DBSCAN basedDBSCAN-based bibliographic coupling, (2) centiles- based bibliographic coupling and (3) sections basedsection-based bibliographic coupling. Comprehensive experiments using utilizing both the user study and the automatedic evaluations were conducted to evaluate the proposed approaches. The results showed proposed approaches recorded significant improvement over traditional bibliographic coupling and the content- based research paper recommendation approaches.
Chapter 3 - Methodology
As mentioned above, Twe have undertaken to examinehis research has proposed three approaches for research paper recommendation such as: (1) DBSCAN basedDBSCAN-based approach, (2) CPA basedCPA-based approach (3) and (3) sections basedsection-based approach. These approaches have beenare discussed in the chapters 4, 5, and 6 respectively. Before going in to the details of each approach, we wanted to give the reader an overview to the reader aboutof what we have done, why we have proceeded in this direction and how these approaches came into existence. To avoid repetition, Furthermore, thethe common steps required to understand the flow in chapters 4, 5, and 6 have been are explained beforehand once in this chapter to reduce the repetition. Therefore, it’s It is therefore necessary advisable to read this chapter before going on to read chapter 4, 5, and 6the subsequent chapters.. Furthermore, the common steps required to understand the flow in chapter 4, 5, and 6 have been explained once in this chapter to reduce the repetition.
As highlighted in the previous chapters, Bibliographic bibliographic coupling presents a relationship between two papers based on their common references. However, whereas co-citation presents such relationships between two papers based on their co-occurrencesbeing mutually cited in other papers. CFurthermore, the co-citation has further been further studied at content level, and while Bibliographic bibliographic cCoupling has not been studied at the content level. This research attempts to evaluate the Bbibliographic coCoupling at content level. For In our efforts to provide a moresuch comprehensive evaluation we looked intoexamined the literature and found that co-citation was extended with respect to content in an approach known as Citation Proximity Analysis, or CPA (Gipp et al, 2009). Therefore, we have implemented the same approach for bBibliographic cCoupling. Furthermore, we have proposed two other new approaches based on the contents of bibliographically coupled papersing. All The three approaches have been mentionedmay outlined as below:
1. CPA basedCPA-based in-text citations proximity analysis
2. DBSCAN clustering of in-text citations
3. Section basedSection-based in-text citations proximity analysis
Figure 2.1 shows the overall flow of the research. The figure highlights, highlighting the different steps of our research.
In the first step, we gathered the dataset for all of ourpreparation for our experiments. We , we amassed two datasets of different sizes used using a crawler to gather the dataset from Citeseerx,1. We collected two datasets of different sizes. We performed initial experiments on the smaller dataset, and then performed comprehensive experiments on the larger dataset. The details of these datasets and how they were gathered are provided later on in this chapter.
The first approach we used was DBSCAN-based. Since the lengths of scientific papers can vary, some being perhaps and some papers may be 15,000 to- 20,000 words long while others may be and others 3,000 to- 5,000 words long, comparing the proximity of citations using the position of citations may lead to incorrect results. So, toTo fix thiscompensate for this issue, we normalized the values of citations’ positions within the full text of documents in the case of DBSCAN based approach. For our research, we use the Mix-Max Normalization. This algorithm performs a linear transformation on the data values. For the interval [MinX, MaxX] for a feature X, we used Min-Max normalization to transform it into a new interval [New_MinX, New_MaxX]. Similarly, each value, v, in the original interval is converted into a new value, New_v, using the following formula:
(Eq. 1)
For our research, v represents the location of the citations.
In our proposed algorithm, we used a density- based clustering approach called DBSCAN to discover the clusters of citations. DBSCAN discovers the clusters based on the density of the items in the item set. The two parameters used in DBSCAN are Epsilon and minPts. This approach is discussed in full detail in Chapter 4.
In our second proposed approach, we used the CPA (Citation Proximity Analysis) (CPA) in bibliographic coupling for recommending papers. This approach extends the traditional bibliographic coupling by also integrating the proximities of in-text citations in it. This approachIt mines the p mines the patterns of in-text citations in bibliographically coupled papers and creates theirrecognizes clusters based on their normalized proximity using the centile positions. We discuss this approach in detail in Chapter 5.
Our third proposed approach is based on the intuition that the authors cite certain papers in the specificparticular sections for certain reasons. Citations from different sections have different weights in determining the similarity between the documents. In this approach we analyzed the in-text citations from different sections of thevarious research papers and tried to find outdiscover whether if the existence of an in-text citation in a particular section has any impact on the accuracy of paper recommendations. We discuss this approach in Chapter 6.
After performing the experiments, we evaluated our proposed approaches and compared their accuracies with the traditional bibliographic coupling and content based approachedapproaches. We evaluated our approaches in two steps:
1. User Study
2. Automated Approach
For the user study, we used a gold standard dataset that consisted of 320 bibliographically coupled pairs. Every paper was evaluated by two distinct individual users. For each paper, the inter-rater agreement between the users was calculated between the users by using Spearman’s correlation coefficient. This agreedUser agreement ranking by both users was considered to beprovided the gold standard ranking. Using this gold standard dataset, the proposed approach was compared with the bibliographic coupling approach and the content based approach. This comparison was done using Spearman’s correlation coefficient. The comparison showed that our proposed approaches performed better than the traditional bibliographic coupling and the content based approaches. We discuss the evaluation and the results in the Chapter 7.
We also used an automated approachinstrument to evaluate the performance of our proposed approaches. For this we used made us of the Jenson Shannon Divergence (JSD). JSD finds the distance between two probability distributions. In the case of research papers, the words- distribution of individual research papers formed one probability distribution and the words -distribution of the entire cluster formed the second probability distribution. This results of this automated apinstrumentproach also suggested that our proposed approached approaches produced moreprovided greater accuracy as compared tothan the existing approaches. We discuss this approach process and the results in detail in Chapter 7.
3.1 Dataset Selection
In order to comprehensively evaluate our proposed approaches, a comprehensive dataset was required. There are many different digital libraries and online resources that offer the datasets. For example, PubMed provides access to almost 27 Million million citations for biomedical literature. Similarly Scopus is another resource that contains a huge repository of research papers. However, in case of most of these repositories, the access few of these repositories provide access to the datasets for is not available for free. Users need have to buy the datasetspay for it. Another issue with some of these repositories repositories is that it is a challenging task to correctly extract the references from the papers. The task process of downloading bibliographically coupled papers becomes is a tough taskcomplicated.
In our case wFor this study, we used a digital library called CiteSeer to gather our dataset. CiteSeer is a huge repository that has around 2 million publications indexed. ItCiteSeer provides access to the metadata (author’s name, venue and year of publication, etc.) and the full texts of research papers. Many rResearchers have used the CiteSeer data in the past for various different tasks, that includeing text classification, collective classification and citation recommendation etc. (Wang et. al., 2016). There are two main reasons for using this digital library. The Ffirst reason is that it provides free access to the datasets, which can also be accessed in many different ways. The second reason is that it maintains retains all the cited papers in a special table, and the citing articles can be linked to them using a key attribute CID. In other words, CiteSeer makes it easier tosimplifies the process of downloading the datasets of bibliographically coupled papers.
We developed a focused crawler to download the two different datasets. We downloaded two different datasets. We used the first dataset for initial experiments, and then used the second one for the comprehensive and more extensive and comprehensive experiments. We called them dataset_1dataset_1 and dataset_2dataset_2. Initially, we collected the dataset_1dataset_1, that containscontaining 320 bibliographically coupled papers. Later, we collected the larger dataset dataset_2dataset_2, that containscontaining 5,000 bibliographically coupled papers from different domains.
We used the 17 queries mentioned outlined in the Table 2.1 to collect the dataset_2dataset_2. We selected tThese queries were chosen in order so that we could collect ato provide a comprehensive and diversified dataset.
The dataset_1dataset_1 consisted of 320 bibliographically coupled papers which were divided into 32 subsets. Each subset consisted of 10 papers that were bibliographically coupled based on a certain query paper. The dataset_2dataset_2 was divided into 226 subsets. These subsets were generated based on the combination of the search query used and the cited-paper-id. These subsets were later combined into 17 groups, each representing a query.
3.2 Content Extraction
These datasets contained the research papers in PDF format. Although While these research papers in PDF format can provide useful information in PDF format as well, we had to convert them into XML but in order to fetch some other important aspects from of the contents of these research papers, we needed to convert them into XML. In the next step, w We converted these papers into XML using an online tool called PDFx2. PDFx This is a specialized tool for the conversion of research papers from PDF to XML format (Constantin et al 2013), and is . PDFx is a useful tool for converting PDF files to XMLfiles in bulk.
The XML files contained certain important XML elements, the most important of which are the . The most important of these are the section, ref and xref elements. The element xref with the attribute ref-type="bibr" represents the in-text citations and can be linked to the tags through the attribute rid. This rid attribute proves to be very helpful in counting the frequencyies of the in-text citations within sections.
The section element refers to all the sections inside the research paper. This element consists of a nested heading element denoted by h1. This heading tag refers to the heading of each section. PDFx provides twoTwo more further levels of headings are also provided, element i.enamely. h2 and h3. In our section basedsection-based approach, we used the Document Object Model (DOM) to traverse the XML files and to fetch the section headings.
Both of these datasets were stored in an SQL database. The information stored in the database includes the metadata of research papers (author-name, venue and year-of-publication etc.), DOIs of all the cited papers, DOIs of all the citing papers, the positions of in-text citations in all the citing papers, the sections headings of all the citing papers and the centiles to which the in-text citations belong.
3.3 Proposed Approaches
We proposed three approaches which include:
1. DBSCAN basedDBSCAN-based. approach,
2. CPA basedCPA-based approac.h and
3. Sections basedSection-based approach..
In the following sub-sections, we will explain all of these approaches.
3.3.1 DBSCAN BasedDBSCAN-based Approach:
The first approach that we proposed to for recommending research papers is using uses a density- based clustering algorithm called DBSCAN (Density-based spatial clustering of applications with noise). Our detailedexamination of the literature showed that, while researchers had found that the use of citations proximity analysis for co-citation can help improve accuracy, that the impact of using proximities of in-text citations in bibliographic coupling has not been analyzed extensively in the past. , Researchers studied the impact of using citations proximity analysis for co-citation and found out that using the proximity can help improve the accuracy. Therefore, we decided to analyze the impact of using proximity analysis and the positions of in-text citations in the full texts in cases of bibliographic coupling, in the field of research paper recommendation.
In this approach, firstly weWe first extracted all the in-text citations from the bibliographically coupled papers. In the next stepNext, we found the proximities of all the in-text citations. To compensate for the varying lengths of papers, Since we know that the lengths of papers can vary. Therefore, we used the min-max normalization in order to normalized the proximities of the in-text citations using min-max normalization..
In the traditional DBSCAN algorithm, the clusters are formed using two parameters: which are ε and minPts. ε represents the radius and minPts represent the number of minimum points required within the ε. The values of these parameters are provided as inputted. In our case, we performed an extensive experiment on dataset_1dataset_1 to determine the value of ε that produced the most accurate recommendations. We found out that the best value of ε for the sake of accuracy was 150. Later on, we used this value of ε on our dataset_2dataset_2.
We discuss the details of this approach, and the details of its evaluation and comparison with other approaches, in Chapter 4. In Chapter 4, we discuss the details of how we evaluated this approach and how it was compared with other approaches.
3.3.2 CPA BasedCPA-based Approach:
As we mentioned in the previous chapter, more than 55% of approaches to research paper recommendation utilize the content of papers to recommend research papers. Co-citation is one of the oldest citation based approaches in this regard. Researchers have performed content analysis in on co-citation and have found improvements in accuracy (Boyack et. al., 2013), (Gipp et. al., 2009). Gipp et. aAl. proposed a CPA basedCPA-based approach for co-citation and their results showed that the relevance between two papers is higher in cases where the material cited is if a common citing paper cites them from the same sentence. The relevance decreases if the common citing paper cites them from thewhen the material cited is from the same paragraph instead of the same sentence. However, according to Boyack et.at al. (2013), the reference positions in the full text can be specified without the sentence, paragraph, and section demarcations (Boyack et al, 2013). The In both studies, results showed an improved accuracy in paper recommendation as compared to simple co-citation.
This The improvement in accuracy by associated with using the use of centile locations of in-text citations in co-citation analysis motivated us to explore the in-text citation occurrences, proximities and patterns in the bibliographic coupling. The proposed approach clusters the in-text citations based on their centile positions.
Initially we used the dataset_1dataset_1 for this approach too. In this approach, firstlyFirst, we find found the positions of the in-text citations. In the next stepThen, we calculated the centile location of each in-text citation. In the next step, theNext, the distance between the centile values of all the in-text citations pairs are were calculated. These values are were stored in the database and are used by five different citation proximity schemes that cluster the citation percentile values using different thresholds. We used two weighting schemes which were proposed by Boyack et. al. (Boyack et al, 2013), and also . And we used the dataset_1dataset_1 in order to propose test and evaluate three3 new weighting schemes for our proposed approach.
We discuss the details of this approach and these weighting schemes in the Chapter 5.
3.3.3 Sections BasedSection-based Approach:
The way the in-text citations are distributed in the full text of a research paper varies from author to author. The way the authors place the citations is subjectively decided. However, studies (Cronin, 1984; Small, 1976) suggest that all the authors generally follow a certain set of procedural standards when referencing other papers. Another study (Ding et al, 2013) highlights the fact that authors normally tend to prefer certain sections over the others while when distributing the in-text citations. According to this study, the citations are most common in lLiterature rReview sections, makes for the second most citing section followed by the mMethodology sections.
This raised our interest in exploring the section basedsection-based bibliographic coupling for paper recommendation. In this approach, we used the dataset_2dataset_2 to fetch the sections from all the citing papers. In the next step, we mapped these sections to a set of generic sections that were determined using the previous studies (Golshan et al 2012) (Hengl et al 2013). In the next stepThen we assigned weights to the in-text citations from all sections. Literature shows that the in-text citations from the methodology and results sections are given more weight than those from the introduction sections, and . And the in-text citations from the related work carry the least weight (Teufel et al 2009) (Sugiyama et al 2013).
The details of this approach are further discussed in the Chapter 6.
3.4 Evaluation
There are three main methods of evaluating the research paper recommendation systems (Beel et. al., 2013). These are: (1) user studies, (2) online evaluation and (3) offline evaluation. Beel et. al. reviewed 176 papers and found that 69% of the approaches used offline evaluation, 34% used the user studies and 7% used the online evaluation for paper recommendation approaches. In the case of the offline evaluation, 29% of the approaches used the CiteSeer data.
In order to evaluate our proposed approaches, we needed to use a benchmark dataset. Unfortunately, there is no such benchmark dataset available for evaluating the approaches to research paper recommendation approaches. There Nor is there exists noa standard method for evaluating the research paper recommendation systems. However many researchers have used the user studies to evaluate their research paper recommendation approaches (Beel et al, 2013), (Lee et al, 2008). Therefore, we also used the user study to evaluate one of our proposed approaches (DBSCAN basedDBSCAN-based approach). For this purpose, we used our smaller dataset consisting of 320 bibliographically coupled papers and carried out a user study where each paper was evaluated and manually ranked by 2 users. Later, we used the Spearman Coefficient to measure the level ofinter-rater agreement between the two users for each paper.
The user study is an effective way of evaluating the research paper recommendation systems, but . However it suffers from the problem of scalability. A user study can be conducted only in case ofwith smaller datasets. However, itIt can become very costly in case of where larger datasets are used and is not the best option for comprehensive evaluation.
In order to overcome this the limitations of user study evaluation, we used offline evaluations and employed an automatic automated instrument forway of evaluating all of our proposed approaches. For this purpose, we used Jenson Shannon Divergence (JSD). Jenson Shannon Divergence (JSD) is used to measure), which measures the similarity between two probability distributions. It is based on the Kullback Leibler divergence. JSD finds the distance between two probability distributions. In the case of research papers, the words distribution of individual research papers formed forms one probability distribution and the words distribution of the entire cluster formed forms the second probability distribution.
Using this automatic automated evaluationinstrument, we were able to extensively evaluate the three approaches that we proposed. We discuss the details of the evaluation of each approach in their respective chapters.