Methodology

A detailed overview of our data collection, processing, and analysis methods

Data Collection

Historical Trip Data

We collected historical Citi Bike trip data from January 2024 to October 2025, sourced from Citi Bike System Data. The dataset includes 96 monthly CSV files containing detailed trip information.

  • Total raw dataset size: ~15 GB
  • Total trips in raw data: ~84 million
  • Filtered dataset: 529,908 trips involving Columbia stations

Real-Time Station Status

Live station status data is fetched from the Citi Bike GBFS (General Bikeshare Feed Specification) API. This provides real-time information on bike availability, dock availability, and station status.

Station Selection

We focused our analysis on 7 Citi Bike stations in the Columbia University area, covering Morningside Heights and Manhattanville neighborhoods:

  • Broadway & W 122 St (Station ID: 7783.18)
  • Morningside Dr & Amsterdam Ave (Station ID: 7741.04)
  • W 120 St & Claremont Ave (Station ID: 7745.07)
  • Amsterdam Ave & W 119 St (Station ID: 7727.07)
  • W 116 St & Broadway (Station ID: 7713.11)
  • W 116 St & Amsterdam Ave (Station ID: 7692.11)
  • W 113 St & Broadway (Station ID: 7713.01)

These stations were selected based on their proximity to Columbia University campus and high relevance to student and staff commuting patterns.

Data Processing

Filtering Process

From the complete Citi Bike dataset, we filtered trips where either the origin (start_station_id) or destination (end_station_id) was one of our 7 Columbia-area stations. This resulted in 529,908 trips.

Data Schema

Each trip record contains 13 fields:

  • ride_id: Unique trip identifier
  • rideable_type: Bike type (classic_bike, electric_bike)
  • started_at, ended_at: Trip start and end timestamps
  • start_station_name, start_station_id: Origin station details
  • end_station_name, end_station_id: Destination station details
  • start_lat, start_lng, end_lat, end_lng: GPS coordinates
  • member_casual: User type (member or casual rider)

Data Pipeline

  1. Load all 96 monthly CSV files using Python pandas
  2. Parse datetime columns (started_at, ended_at)
  3. Filter trips involving Columbia stations
  4. Sort by trip start and end times
  5. Export to consolidated CSV file (columbia_filtered_citibike.csv, ~52 MB)

Analysis Methods

Temporal Analysis

We analyze usage patterns across multiple time scales:

  • Seasonal trends: Monthly aggregation to identify seasonal variations
  • Weekly patterns: Day-of-week analysis to understand weekday vs. weekend usage
  • Hourly distribution: Hour-of-day analysis to identify peak usage times

Station-Level Metrics

Key metrics calculated for each station:

  • Total trips: Count of trips originating or ending at the station
  • Inflow/Outflow balance: Net difference between arriving and departing bikes
  • Bike type distribution: Breakdown of classic vs. electric bike usage
  • User type distribution: Member vs. casual rider proportions

Visualization Tools

We use interactive visualizations built with Plotly and deck.gl to present findings:

  • Time series charts: Track usage trends over time
  • Heatmaps: Display hour-by-day usage patterns
  • Bar charts: Compare metrics across stations
  • 3D map visualization: Real-time station status with geographic context

Technologies & Tools

Data Processing

  • Python 3.x
  • pandas (data manipulation)
  • Jupyter notebooks (exploratory analysis)

Visualization

  • Plotly (interactive charts)
  • deck.gl (3D map visualization)
  • MapLibre (map rendering)

Web Application

  • Next.js 14 (React framework)
  • TypeScript
  • TailwindCSS (styling)

Backend

  • FastAPI (Python web framework)
  • RESTful API design
  • GBFS API integration