A Graph-based Dataset of Commit Historyof Real-World Android apps
A Graph-based Dataset of Commit Historyof Real-World Android apps
Obtaining a good dataset to conduct empirical studies on the en-gineering of Android apps is an open challenge. To start tack-ling this challenge, we present AndroidTimeMachine, the first,self-contained, publicly available dataset weaving spread-out datasources about real-world, open-source Android apps. Encoded as agraph-based database, AndroidTimeMachine concerns 8,431 realopen-source Android apps and contains: (i) metadata about theapps’ GitHub projects, (ii) Git repositories with full commit historyand (iii) metadata extracted from the Google Play store, such asapp ratings and permissions Code Shoppy
Since mobile apps differ from traditional software and require totackle new problems (e.g., power management and privacy protec-tion [5,7,15,16]), researchers are conducting empirical studies—especially by mining software repositories—to understand and sup-port mobile software development.As an example of recent research on apps, Malavoltaet al.ana-lyzed more than 11,000 apps published in the Google Play storeand investigated the end users’ perceptions about various hybriddevelopment frameworks [12]. Also, Linares-Vásquezet al.mined54 Android apps from the Google Play store to find programmingpractices that may lead to an excessive energy consumption [5]A common challenge when investigating apps is accessingcandi-date subjects(i.e., the app binaries or source code). A widely adoptedapproach is to gather information from open-source software (OSS)market places, F-Droid1[4,9,13]. Nevertheless, relying on F-Droidimpacts the number of projects that can be considered, as it onlycontains metadata of 2,697 apps.2Moreover, for every study, re-searchers have to (i) systematically explore several online reposito-ries to find analyzable apps, (ii) filter out source code not intendedfor the Android platform, and (iii) verify apps’ consistency withinofficial distribution channels.To improve this situation, we propose AndroidTimeMachine, agraph-based dataset with data linked from different sources con-cerning the development and publication process of 8,431 OSSAndroid apps. We combine information from GitHub and GooglePlay to create a unified dataset including (i) metadata of GitHubprojects, (ii) full commit and code history, and (iii) metadata fromthe Google Play store. This dataset is the largest collection ofpublished OSS Android apps with linked source code and storemeta-data that we know of. The connected nature of this datasetand the included revision history allow a holistic view on OSSAndroid apps from development to publication on Google Play.AndroidTimeMachine is composed of two main parts: A graph-based database (which facilitates understanding and navigation byfocusing on links between apps, repositories, commits, and con-tributors) and a Git server hosting a mirror of all 8,431 GitHubrepositories (thus providing a self-contained snapshot of the appswithin the dataset). AndroidTimeMachine is publicly accessible athttp://androidtimemachine.github.io and it is available as a Dockercontainer image, which runs an instance of a Neo4J database withall the metadata and a GitLab server hosting all the mirroredGitHub repositories
Since mobile apps differ from traditional software and require totackle new problems (e.g., power management and privacy protec-tion [5,7,15,16]), researchers are conducting empirical studies—especially by mining software repositories—to understand and sup-port mobile software development.As an example of recent research on apps, Malavoltaet al.ana-lyzed more than 11,000 apps published in the Google Play storeand investigated the end users’ perceptions about various hybriddevelopment frameworks [12]. Also, Linares-Vásquezet al.mined54 Android apps from the Google Play store to find programmingpractices that may lead to an excessive energy consumption [5]A common challenge when investigating apps is accessingcandi-date subjects(i.e., the app binaries or source code). A widely adoptedapproach is to gather information from open-source software (OSS)market places, F-Droid1[4,9,13]. Nevertheless, relying on F-Droidimpacts the number of projects that can be considered, as it onlycontains metadata of 2,697 apps.2Moreover, for every study, re-searchers have to (i) systematically explore several online reposito-ries to find analyzable apps, (ii) filter out source code not intendedfor the Android platform, and (iii) verify apps’ consistency withinofficial distribution channels.To improve this situation, we propose AndroidTimeMachine, agraph-based dataset with data linked from different sources con-cerning the development and publication process of 8,431 OSSAndroid apps. We combine information from GitHub and GooglePlay to create a unified dataset including (i) metadata of GitHubprojects, (ii) full commit and code history, and (iii) metadata fromthe Google Play store. This dataset is the largest collection ofpublished OSS Android apps with linked source code and storemeta-data that we know of. The connected nature of this datasetand the included revision history allow a holistic view on OSSAndroid apps from development to publication on Google Play.AndroidTimeMachine is composed of two main parts: A graph-based database (which facilitates understanding and navigation byfocusing on links between apps, repositories, commits, and con-tributors) and a Git server hosting a mirror of all 8,431 GitHubrepositories (thus providing a self-contained snapshot of the appswithin the dataset). AndroidTimeMachine is publicly accessible athttp://androidtimemachine.github.io and it is available as a Dockercontainer image, which runs an instance of a Neo4J database withall the metadata and a GitLab server hosting all the mirroredGitHub repositoriesWe only considered applications available in the Google Play store.This limitation is mitigated by the fact that Google Play is theofficial Android app store and offers the largest selection of Androidapps [1]. We mined Google Play from a server in our region, thuslimiting the data collection to the apps available here.Data selection can be biased by the presence of the source codeon GitHub. We consider this acceptable considering that, in therecent years, GitHub has been the most known platform for theopen-source community and it offers a large and diverse selectionof OSS projects [6].Searching candidate repositories using the GitHub API was notpossible due to limitations on the number of results returned byeach query. Indeed, even when stratifying search queries (e.g.,byfilesize, with a byte-level granularity), not all the results could beretrieved. We overcame this issue by using BigQuery.Resorting to a heuristic approach for matching Google Playlistings to GitHub repositories entails the risk of mismatches. Es-pecially the 5.0% of apps that were linked by popularity measuresmight have been wrongly classified. However, confidence of cor-rect matches is high for the 77.1% of apps for which only a uniquerepository contains anAndroidManifest.xmlfile.3 RELATED WORKPrevious studies created data collections of OSS Android applica-tions. For their study on app releases, Nayebiet al.[13] linked 69F-Droid apps with version control repositories. Where available,metadata from Google Play was included. A similar dataset of OSSAndroid apps was constructed by Krutzet al.[9] to facilitate secu-rity research [10]. Daset al.[4] used F-Droid as a starting point foridentifying open-source Android apps. They built a dataset for theanalysis of performance related commits of mobile applications bymatching apps listed on F-Droid against GitHub repositories. Later,the apps were filtered considering their availability on GooglePlay. The final dataset was composed of 2,443 apps.These datasets have the advantage that F-Droid contains exe-cutable app packages which our collection does not include. How-ever, AndroidTimeMachine covers more apps than listed on F-Droidbecause we identify candidate repositories searching the Androidapp manifest; this approach provides a more realistic samplopen-source Android apps and increase the number and diversityof apps to perform research on.4 CONCLUSIONSWe created AndroidTimeMachine, a dataset of 8,431 real-worldopen-source Android apps. It combines source and commit historyinformation available on GitHub with the metadata from GooglePlay store. The graph representation used for structuring the dataeases the analysis of the relationships between source code andmetadata. The dataset is provided as Docker container to improveits accessibility and extensibilit
https://codeshoppy.com/php-projects-titles-topics.html
Obtaining a good dataset to conduct empirical studies on the en-gineering of Android apps is an open challenge. To start tack-ling this challenge, we present AndroidTimeMachine, the first,self-contained, publicly available dataset weaving spread-out datasources about real-world, open-source Android apps. Encoded as agraph-based database, AndroidTimeMachine concerns 8,431 realopen-source Android apps and contains: (i) metadata about theapps’ GitHub projects, (ii) Git repositories with full commit historyand (iii) metadata extracted from the Google Play store, such asapp ratings and permissions Code Shoppy
Since mobile apps differ from traditional software and require totackle new problems (e.g., power management and privacy protec-tion [5,7,15,16]), researchers are conducting empirical studies—especially by mining software repositories—to understand and sup-port mobile software development.As an example of recent research on apps, Malavoltaet al.ana-lyzed more than 11,000 apps published in the Google Play storeand investigated the end users’ perceptions about various hybriddevelopment frameworks [12]. Also, Linares-Vásquezet al.mined54 Android apps from the Google Play store to find programmingpractices that may lead to an excessive energy consumption [5]A common challenge when investigating apps is accessingcandi-date subjects(i.e., the app binaries or source code). A widely adoptedapproach is to gather information from open-source software (OSS)market places, F-Droid1[4,9,13]. Nevertheless, relying on F-Droidimpacts the number of projects that can be considered, as it onlycontains metadata of 2,697 apps.2Moreover, for every study, re-searchers have to (i) systematically explore several online reposito-ries to find analyzable apps, (ii) filter out source code not intendedfor the Android platform, and (iii) verify apps’ consistency withinofficial distribution channels.To improve this situation, we propose AndroidTimeMachine, agraph-based dataset with data linked from different sources con-cerning the development and publication process of 8,431 OSSAndroid apps. We combine information from GitHub and GooglePlay to create a unified dataset including (i) metadata of GitHubprojects, (ii) full commit and code history, and (iii) metadata fromthe Google Play store. This dataset is the largest collection ofpublished OSS Android apps with linked source code and storemeta-data that we know of. The connected nature of this datasetand the included revision history allow a holistic view on OSSAndroid apps from development to publication on Google Play.AndroidTimeMachine is composed of two main parts: A graph-based database (which facilitates understanding and navigation byfocusing on links between apps, repositories, commits, and con-tributors) and a Git server hosting a mirror of all 8,431 GitHubrepositories (thus providing a self-contained snapshot of the appswithin the dataset). AndroidTimeMachine is publicly accessible athttp://androidtimemachine.github.io and it is available as a Dockercontainer image, which runs an instance of a Neo4J database withall the metadata and a GitLab server hosting all the mirroredGitHub repositories
Since mobile apps differ from traditional software and require totackle new problems (e.g., power management and privacy protec-tion [5,7,15,16]), researchers are conducting empirical studies—especially by mining software repositories—to understand and sup-port mobile software development.As an example of recent research on apps, Malavoltaet al.ana-lyzed more than 11,000 apps published in the Google Play storeand investigated the end users’ perceptions about various hybriddevelopment frameworks [12]. Also, Linares-Vásquezet al.mined54 Android apps from the Google Play store to find programmingpractices that may lead to an excessive energy consumption [5]A common challenge when investigating apps is accessingcandi-date subjects(i.e., the app binaries or source code). A widely adoptedapproach is to gather information from open-source software (OSS)market places, F-Droid1[4,9,13]. Nevertheless, relying on F-Droidimpacts the number of projects that can be considered, as it onlycontains metadata of 2,697 apps.2Moreover, for every study, re-searchers have to (i) systematically explore several online reposito-ries to find analyzable apps, (ii) filter out source code not intendedfor the Android platform, and (iii) verify apps’ consistency withinofficial distribution channels.To improve this situation, we propose AndroidTimeMachine, agraph-based dataset with data linked from different sources con-cerning the development and publication process of 8,431 OSSAndroid apps. We combine information from GitHub and GooglePlay to create a unified dataset including (i) metadata of GitHubprojects, (ii) full commit and code history, and (iii) metadata fromthe Google Play store. This dataset is the largest collection ofpublished OSS Android apps with linked source code and storemeta-data that we know of. The connected nature of this datasetand the included revision history allow a holistic view on OSSAndroid apps from development to publication on Google Play.AndroidTimeMachine is composed of two main parts: A graph-based database (which facilitates understanding and navigation byfocusing on links between apps, repositories, commits, and con-tributors) and a Git server hosting a mirror of all 8,431 GitHubrepositories (thus providing a self-contained snapshot of the appswithin the dataset). AndroidTimeMachine is publicly accessible athttp://androidtimemachine.github.io and it is available as a Dockercontainer image, which runs an instance of a Neo4J database withall the metadata and a GitLab server hosting all the mirroredGitHub repositoriesWe only considered applications available in the Google Play store.This limitation is mitigated by the fact that Google Play is theofficial Android app store and offers the largest selection of Androidapps [1]. We mined Google Play from a server in our region, thuslimiting the data collection to the apps available here.Data selection can be biased by the presence of the source codeon GitHub. We consider this acceptable considering that, in therecent years, GitHub has been the most known platform for theopen-source community and it offers a large and diverse selectionof OSS projects [6].Searching candidate repositories using the GitHub API was notpossible due to limitations on the number of results returned byeach query. Indeed, even when stratifying search queries (e.g.,byfilesize, with a byte-level granularity), not all the results could beretrieved. We overcame this issue by using BigQuery.Resorting to a heuristic approach for matching Google Playlistings to GitHub repositories entails the risk of mismatches. Es-pecially the 5.0% of apps that were linked by popularity measuresmight have been wrongly classified. However, confidence of cor-rect matches is high for the 77.1% of apps for which only a uniquerepository contains anAndroidManifest.xmlfile.3 RELATED WORKPrevious studies created data collections of OSS Android applica-tions. For their study on app releases, Nayebiet al.[13] linked 69F-Droid apps with version control repositories. Where available,metadata from Google Play was included. A similar dataset of OSSAndroid apps was constructed by Krutzet al.[9] to facilitate secu-rity research [10]. Daset al.[4] used F-Droid as a starting point foridentifying open-source Android apps. They built a dataset for theanalysis of performance related commits of mobile applications bymatching apps listed on F-Droid against GitHub repositories. Later,the apps were filtered considering their availability on GooglePlay. The final dataset was composed of 2,443 apps.These datasets have the advantage that F-Droid contains exe-cutable app packages which our collection does not include. How-ever, AndroidTimeMachine covers more apps than listed on F-Droidbecause we identify candidate repositories searching the Androidapp manifest; this approach provides a more realistic samplopen-source Android apps and increase the number and diversityof apps to perform research on.4 CONCLUSIONSWe created AndroidTimeMachine, a dataset of 8,431 real-worldopen-source Android apps. It combines source and commit historyinformation available on GitHub with the metadata from GooglePlay store. The graph representation used for structuring the dataeases the analysis of the relationships between source code andmetadata. The dataset is provided as Docker container to improveits accessibility and extensibilit
https://codeshoppy.com/php-projects-titles-topics.html
Comments
Post a Comment