🚀 Playbook: How to scrape profile info from GitHub
This is a technical playbook I use to export hundreds of GitHub user profiles to a spreadsheet for easy analysis and targeting
GitHub as a source of information
GitHub is the world’s largest social network for technical people, which makes it a great source of information and targeted activities if you’re targeting a technical audience.
But unlike other social networks, GitHub isn’t built for direct communication or chat. It’s designed to get people to look at projects and contribute code. GitHub users can also leave a “star” on projects they’re interested in. It’s equivalent to an “upvote” or a “like” on other platforms.
Analyzing clusters of users based on the projects they star, or the contributions they make can be very helpful for dev-marketers. Check out this playbook, for example, to see how I’ve put this into action IRL.
The hard part is accessing and exporting this information from GitHub.
So I’m going to outline the steps and the tools I use to export this data from GitHub and get it into a usable CSV.
GitHub GraphQL API Explorer
The main tool I use for this is the GitHub GraphQL API Explorer. It allows me to export lists of Stargazers from specific GitHub projects, together with their GitHub profile information.
It’s not glamorous, but it’s the path of least resistance to extracting the info I need.
Here’s how I use it.
The initial query - first 100 results
Open the GitHub GraphQL API Explorer and login with your GitHub account.
Paste the following query into the left pane of the Explorer to extract the relevant data from your desired GitHub project:
{
repository(owner: "[repo owner]", name: "[repo name]"){
stargazers(first: 100){
pageInfo{
endCursor
hasNextPage
}
nodes{
company
email
login
email
twitterUsername
websiteUrl
followers{
totalCount
}
organizations(first:10){
nodes{
name
}
}
}
}
}
}
You’ll need to replace the “owner” and “ name” to match the specific repository you’re targeting to extract stargazer lists.
Press the "play” button to run the query. The output is a list of 100 stargazers with the relevant profile information displayed in JSON format on the right side of the Explorer.
100 is the maximum number of results that the API allows in one batch. You can extract additional batches of 100 stargazers from the list by simply running the command again, and telling it to extract the next set of 100 users. But we’ll get to that in a second. First, let’s understand what to do with this raw JSON.
Converting the JSON to CSV
Once you have the JSON staring you in the face, you need to convert it to something usable, like a CSV. I found a simple “JSON to CSV converter” that you can use for free.
Copy the JSON results from the GitHub API explorer, and paste it into the JSON converter. The output is a table with the results that can be copied/pasted into a spreadsheet.
There are ways to further automate this part of the process, but you need to know a little coding to do that. So for now, we’ll stick to the basics.
With the first 100 results in your spreadsheet, you’re ready to get some more. To do this, you need to ask the GitHub Explorer to show you the next set of 100 results in the list.
Asking GitHub for the next 100 users in the list
To do so, go back to the same GitHub Explorer window and modify the top of the query that you’re using in the left panel. The top few lines currently look like this:
{
repository(owner: "[repo owner]", name: "[repo name]"){
stargazers(first: 100){
pageInfo{
endCursor
You now want it to modify those lines with two simple changes:
Add the “after” parameter
Make it point to the “endCursor” string
So now, the query on the left side looks like this:
{
repository(owner: "[repo owner]", name: "[repo name]"){
stargazers(first: 100 after: "[the endCursor string]="){
pageInfo{
endCursor
Now, copy the long “endCursor” string that appears at the top of the right panel, above the first 100 results. Paste that string next to the “after” parameter in your query. (replacing “the endCursor string” placeholder text above).
This is basically asking the API to pull the next 100 results in the list.
Every page of 100 results that is displayed has a unique “endCursor” string that appears at the top of it. So to see the next 100 results in the list, simply copy the endCursor, paste it into the query, and run the query again with the “play” button.
Here’s what it looks like IRL:
Each time the new endCursor is applied to the query and the query is run again, the results on the right side will update to show the next 100 users in the list.
For each set of 100 users, simply use the JSON-to-CSV tool to convert it and append it to your growing spreadsheet of contacts.
Run this as many times as needed until you have a healthy number of contacts to work with.
Pro tip: Before closing your API Explorer window and analyzing your spreadsheet of GitHub contacts, Copy the next query that you’ll want to run and paste it somewhere for safekeeping. This will let you pick up where you left off in the stargazers list (which can be pretty long for a popular project).