đ Playbook: How to scrape profile info from GitHub
This is a technical playbook I use to export hundreds of GitHub user profiles to a spreadsheet for easy analysis and targeting
GitHub as a source of information
GitHub is the worldâs largest social network for technical people, which makes it a great source of information and targeted activities if youâre targeting a technical audience.
But unlike other social networks, GitHub isnât built for direct communication or chat. Itâs designed to get people to look at projects and contribute code. GitHub users can also leave a âstarâ on projects theyâre interested in. Itâs equivalent to an âupvoteâ or a âlikeâ on other platforms.
Analyzing clusters of users based on the projects they star, or the contributions they make can be very helpful for dev-marketers. Check out this playbook, for example, to see how Iâve put this into action IRL.
The hard part is accessing and exporting this information from GitHub.
So Iâm going to outline the steps and the tools I use to export this data from GitHub and get it into a usable CSV.
GitHub GraphQL API Explorer
The main tool I use for this is the GitHub GraphQL API Explorer. It allows me to export lists of Stargazers from specific GitHub projects, together with their GitHub profile information.
Itâs not glamorous, but itâs the path of least resistance to extracting the info I need.
Hereâs how I use it.
The initial query - first 100 results
Open the GitHub GraphQL API Explorer and login with your GitHub account.
Paste the following query into the left pane of the Explorer to extract the relevant data from your desired GitHub project:
{
repository(owner: "[repo owner]", name: "[repo name]"){
stargazers(first: 100){
pageInfo{
endCursor
hasNextPage
}
nodes{
company
email
login
email
twitterUsername
websiteUrl
followers{
totalCount
}
organizations(first:10){
nodes{
name
}
}
}
}
}
}
Youâll need to replace the âownerâ and â nameâ to match the specific repository youâre targeting to extract stargazer lists.
Press the "playâ button to run the query. The output is a list of 100 stargazers with the relevant profile information displayed in JSON format on the right side of the Explorer.
100 is the maximum number of results that the API allows in one batch. You can extract additional batches of 100 stargazers from the list by simply running the command again, and telling it to extract the next set of 100 users. But weâll get to that in a second. First, letâs understand what to do with this raw JSON.
Converting the JSON to CSV
Once you have the JSON staring you in the face, you need to convert it to something usable, like a CSV. I found a simple âJSON to CSV converterâ that you can use for free.
Copy the JSON results from the GitHub API explorer, and paste it into the JSON converter. The output is a table with the results that can be copied/pasted into a spreadsheet.
There are ways to further automate this part of the process, but you need to know a little coding to do that. So for now, weâll stick to the basics.
With the first 100 results in your spreadsheet, youâre ready to get some more. To do this, you need to ask the GitHub Explorer to show you the next set of 100 results in the list.
Asking GitHub for the next 100 users in the list
To do so, go back to the same GitHub Explorer window and modify the top of the query that youâre using in the left panel. The top few lines currently look like this:
{
repository(owner: "[repo owner]", name: "[repo name]"){
stargazers(first: 100){
pageInfo{
endCursor
You now want it to modify those lines with two simple changes:
Add the âafterâ parameter
Make it point to the âendCursorâ string
So now, the query on the left side looks like this:
{
repository(owner: "[repo owner]", name: "[repo name]"){
stargazers(first: 100 after: "[the endCursor string]="){
pageInfo{
endCursor
Now, copy the long âendCursorâ string that appears at the top of the right panel, above the first 100 results. Paste that string next to the âafterâ parameter in your query. (replacing âthe endCursor stringâ placeholder text above).
This is basically asking the API to pull the next 100 results in the list.
Every page of 100 results that is displayed has a unique âendCursorâ string that appears at the top of it. So to see the next 100 results in the list, simply copy the endCursor, paste it into the query, and run the query again with the âplayâ button.
Hereâs what it looks like IRL:
Each time the new endCursor is applied to the query and the query is run again, the results on the right side will update to show the next 100 users in the list.
For each set of 100 users, simply use the JSON-to-CSV tool to convert it and append it to your growing spreadsheet of contacts.
Run this as many times as needed until you have a healthy number of contacts to work with.
Pro tip: Before closing your API Explorer window and analyzing your spreadsheet of GitHub contacts, Copy the next query that youâll want to run and paste it somewhere for safekeeping. This will let you pick up where you left off in the stargazers list (which can be pretty long for a popular project).