What Do Travellers Really Think About Swiss Train Stations?

An analysis of Google Maps reviews for SBB stations

Introduction

A train station serves many functions at once. Transit node, waiting space, first point of contact with a city. This analysis examines how passengers experienced SBB stations in Switzerland across all of these dimensions, drawing on real reviews to look beyond the network’s headline performance metrics.

© SBB CFF FFS

This analysis covers 22,000+ Google Maps reviews across 61 SBB stations, exploring ratings, sentiment, language patterns, and what passengers consistently praise or complain about.

Scope and Methodology

Stations were selected using SBB’s trafimage dataset as a filter. I assumed that if a station appears on the official schematic map, it has enough operational significance and foot traffic to generate useful review volume. Smaller stations were excluded as they produce too few reviews on Google Maps to say anything meaningful. Both Bern Europaplatz stations were merged into one entry for crawling.

Not every review for every station was captured. Reviews longer than 340 characters are cut off. The dataset is enough to show what this kind of analysis can reveal, not to make definitive claims about the network or the stations as a whole.

Where This Could Go

This project is a proof of concept. The same analysis can be done on the complete set of reviews and mentions from Google Maps, social media, and travel forums in near-real-time. For example, spikes in negative sentiment around a specific station might show up days before formal complaints do, which could be used to develop an early warning system for issues at stations. The same approach could extend beyond stations to other SBB facilities like ticket offices, parking areas, or bike rental points, anywhere public reviews accumulate and operational decisions depend on customer experience.

NoteTL;DR

I scraped and analysed 22,000+ Google Maps reviews across 61 SBB stations and ran each one through both a keyword filter and an AI classifier (GPT-4o-mini) that picks out which aspect of the station is being discussed and whether the sentiment is positive or negative. A few things stood out:

  • The network of stations averages 4.16 stars, which is high. The negative reviews are not spread evenly though, they pile up at a small group of stations.
  • Reviewers are harshest on stations in their own linguistic region. Tourists writing in English are by far the most generous.
  • Complaints split into two useful groups. Damage control (Safety, Crowds, Toilets) is where investment mostly just stops the bleeding, because even the top rated stations don’t get praised for these. Visible wins (Cleanliness, Connections, Staff/Service, Food & Shops, Signage/Nav) is where the best stations actually get praise, so investment can lift sentiment as well as reduce complaints.
  • Safety, Staff/Service, and Crowds drive the biggest share of negative ratings by volume. Reviews that mention Staff/Service, Connections, or Safety also tend to come with the lowest ratings (1.6-2.0★ averages).

The final deliverable is a ranked list of action items per station, split by the categories Damage control and Visible wins.

Setup

Show code
import warnings
warnings.filterwarnings('ignore')

import pathlib
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import plotly.io as pio
from sklearn.feature_extraction.text import CountVectorizer
import pycountry
import nltk
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords
pio.templates.default = 'plotly_white'
pio.renderers.default = 'plotly_mimetype+notebook_connected'

# Shared color palettes used across all charts for consistency
RATING_SCALE = ['#d73027', '#f46d43', '#fdae61', '#a6d96a', '#1a9641']
LANG_COLORS  = {'de': '#2196F3', 'fr': '#E91E63', 'it': '#4CAF50', 'en': '#FF9800', 'rm': '#9C27B0', 'other': '#9E9E9E'}
ASPECT_COLORS = {
    'Safety': '#d32f2f', 'Crowds': '#e57373', 'Toilets': '#ef6c00',
    'Cleanliness': '#1976d2', 'Connections': '#0288d1', 'Food & Shops': '#388e3c',
    'Staff/Service': '#7b1fa2', 'Signage/Nav': '#ab47bc', 'Seating/Waiting': '#5d4037',
    'Accessibility': '#00796b', 'Lifts/Escalators': '#00838f', 'Parking/Bikes': '#616161',
}

# Shared thresholds: used across all reliability filters
MIN_REVIEWS_PER_STATION = 30   # excludes stations with too few reviews for stable averages
MIN_COMPLAINTS_PER_ASPECT = 10  # excludes aspects with too few mentions to be meaningful
Show code
DATA_DIR = '../data/raw'

stations = pd.read_csv(f'{DATA_DIR}/stations.csv')
reviews  = pd.read_csv(f'{DATA_DIR}/reviews.csv')

# Parse dates
reviews['date_estimated'] = pd.to_datetime(reviews['date_estimated'], errors='coerce')
reviews = reviews[reviews['date_estimated'].notna()].copy()
reviews['year_month'] = reviews['date_estimated'].dt.to_period('M')
reviews['year']       = reviews['date_estimated'].dt.year

# Normalise language codes
reviews['lang'] = reviews['language'].where(reviews['language'].isin(['de','fr','it','en','rm']), other='other')

# Only stations that were successfully scraped
done = stations[stations['scrape_status'] == 'done'].copy()

# Attach scraped review counts to done stations
scraped_counts = reviews.groupby('opuic').size().rename('scraped_reviews')
done = done.join(scraped_counts, on='opuic')
done['scraped_reviews'] = done['scraped_reviews'].fillna(0).astype(int)

print(f'Stations scraped : {len(done)}')
print(f'Reviews collected: {len(reviews):,}')
print(f'Date range       : {reviews["date_estimated"].min().date()}{reviews["date_estimated"].max().date()}')
Stations scraped : 61
Reviews collected: 22,621
Date range       : 2011-05-12 → 2026-05-08

1: Where Are the Stations?

Before diving into the numbers, lets put the stations on a map. The interactive map below shows every scraped station. Colour encodes the overall Google rating (red = low, green = high), and size reflects the total number of reviews. Hover over any dot for details.

Show code
fig = px.scatter_mapbox(
    done.dropna(subset=['latitude','longitude','overall_rating']),
    lat='latitude', lon='longitude',
    color='overall_rating',
    size='review_count_google',
    size_max=30,
    hover_name='name',
    hover_data={
        'overall_rating': ':.1f',
        'review_count_google': True,
        'scraped_reviews': True,
        'latitude': False,
        'longitude': False,
    },
    color_continuous_scale=RATING_SCALE,
    range_color=[1.0, 5.0],
    zoom=5.5,
    center={'lat': 46.8, 'lon': 8.2},
    mapbox_style='carto-positron',
    title='SBB Station Ratings across Switzerland',
    height=600,
    labels={'overall_rating': 'Rating', 'review_count_google': 'Total reviews', 'scraped_reviews': 'Collected reviews'},
)
fig.update_layout(coloraxis_colorbar_title='Rating')
fig.show()
Figure 1: SBB stations across Switzerland. Colour = overall Google rating, size = total review count.
TipMap at a glance

Most of the 61 stations cluster in the 4.0-4.5 star range. The negative tail is small and only affects a few stations (Yverdon-les-Bains, Genève Aéroport, Lenzburg, Olten, etc).

2: The Dataset

Now that the geography is on the table, let’s have a look at what the dataset itself contains. How many reviews, in which languages, and how many actually carry text rather than just a star rating.

Show code
total_google = done['review_count_google'].sum()
total_scraped = done['scraped_reviews'].sum()
coverage = total_scraped / total_google * 100

print(f'Total reviews on Google : {total_google:,}')
print(f'Reviews collected       : {total_scraped:,}  ({coverage:.0f}% coverage)')
print(f'Reviews with text       : {reviews["text"].notna().sum():,}  ({reviews["text"].notna().mean()*100:.0f}%)')
print()
print('Review languages:')
print(reviews['lang'].value_counts().to_string())
Total reviews on Google : 37,154
Reviews collected       : 22,596  (61% coverage)
Reviews with text       : 14,783  (65%)

Review languages:
lang
other    11057
de        5549
en        3711
fr        1727
it         577
Show code
# Star-only vs written reviews
n_text   = reviews['text'].notna().sum()
n_notext = len(reviews) - n_text

fig = go.Figure(go.Pie(
    labels=['Rating + written review', 'Rating-only'],
    values=[n_text, n_notext],
    hole=0.4,
    marker_colors=['#2196F3', '#9E9E9E'],
    textinfo='percent+label',
    textposition='outside',
))
fig.update_layout(title=f'Review type: {len(reviews):,} total reviews')
fig.show()
print(f'Rating + written review: {n_text:,}  ({n_text/len(reviews)*100:.0f}%)')
print(f'Rating-only: {n_notext:,}  ({n_notext/len(reviews)*100:.0f}%)')
Figure 2: Share of scraped reviews that include written text along the rating vs a star-only rating. About 65% of the 22,621 scraped reviews carries text.
Rating + written review: 14,783  (65%)
Rating-only: 7,838  (35%)
Show code
written = reviews[reviews['text'].notna()].copy()

lang_counts = written['lang'].value_counts().reset_index()
lang_counts.columns = ['language', 'count']
lang_labels = {'de': 'German', 'fr': 'French', 'it': 'Italian', 'en': 'English', 'rm': 'Romansh', 'other': 'Other'}
lang_counts['label'] = lang_counts['language'].map(lang_labels)

fig = px.pie(
    lang_counts, values='count', names='label',
    color='language', color_discrete_map=LANG_COLORS,
    title=f'Largest Group of Review Languages: {len(written):,} written reviews',
    hole=0.4,
)
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()
Figure 3: Language breakdown across the 14,783 written reviews. German dominates, followed by English, French, and Italian. ‘Other’ covers languages outside Switzerland’s four national languages and English.
Show code
# Full language breakdown
_ZH = {'zh-tw': 'Chinese (Trad.)', 'zh-cn': 'Chinese (Simp.)'}

def lang_name(code):
    if code in _ZH:
        return _ZH[code]
    lang = pycountry.languages.get(alpha_2=code)
    return lang.name if lang else code

lang_counts = written['language'].value_counts(dropna=False).reset_index()
lang_counts.columns = ['language', 'count']
lang_counts['label'] = lang_counts['language'].apply(
    lambda c: 'Undetected' if pd.isna(c) else lang_name(c)
)
lang_counts['pct'] = (lang_counts['count'] / len(written) * 100).round(1)

print(f'All languages in written reviews ({len(written):,} total):')
print(lang_counts[['label', 'count', 'pct']].to_string(index=False))
All languages in written reviews (14,783 total):
                  label  count  pct
                 German   5549 37.5
                English   3711 25.1
                 French   1727 11.7
             Undetected   1075  7.3
                Italian    577  3.9
                Spanish    407  2.8
                 Korean    224  1.5
             Portuguese    196  1.3
               Romanian    116  0.8
                Russian    115  0.8
               Japanese    109  0.7
                 Arabic     91  0.6
                Turkish     84  0.6
                  Dutch     77  0.5
                Catalan     58  0.4
              Afrikaans     46  0.3
        Chinese (Trad.)     44  0.3
                 Danish     42  0.3
                 Polish     41  0.3
                   Thai     40  0.3
              Ukrainian     36  0.2
        Chinese (Simp.)     36  0.2
             Indonesian     35  0.2
              Hungarian     33  0.2
              Norwegian     32  0.2
                Swedish     31  0.2
                  Czech     30  0.2
                Finnish     24  0.2
   Modern Greek (1453-)     23  0.2
                Tagalog     21  0.1
                 Hebrew     20  0.1
               Croatian     19  0.1
               Estonian     18  0.1
                 Slovak     16  0.1
                 Somali     14  0.1
             Vietnamese     13  0.1
                  Welsh     10  0.1
              Bulgarian      9  0.1
              Slovenian      9  0.1
                Persian      5  0.0
             Lithuanian      4  0.0
               Albanian      4  0.0
             Macedonian      3  0.0
                Latvian      3  0.0
                  Tamil      3  0.0
Swahili (macrolanguage)      2  0.0
                   Urdu      1  0.0

Romansh, Switzerland’s fourth national language, is absent from the dataset. No reviews in Romansh (rm) were detected. Romansh speakers number around ~40,000 and are concentrated in rural Graubünden valleys. None of which are served by stations in this dataset. Romansh speakers are also typically bilingual in German, making German the likely choice when writing a review.

References: https://www.rtr.ch/emissiuns/decodar-nossa-cultura/raetoromanisch/fakten-geschichte/fakten-und-zahlen-raetoromanische-sprache https://www.bfs.admin.ch/asset/de/23366958

TipAbout the dataset

Of 22,000+ reviews, roughly half carry text rather than just stars. German, French, Italian, and English are reasonably balanced, with Romansh entirely absent.

3: Ratings Across Stations

How are ratings distributed, which stations sit at the top and bottom, and which are the most polarising?

Show code
fig = px.histogram(
    done, x='overall_rating',
    nbins=20,
    title='Distribution of Station Ratings (Google aggregate)',
    labels={'overall_rating': 'Overall Rating', 'count': 'Number of Stations'},
    color_discrete_sequence=['#2196F3'],
)
fig.add_vline(
    x=done['overall_rating'].mean(), line_dash='dash', line_color='#E91E63',
    annotation_text=f" Mean: {done['overall_rating'].mean():.2f}",
    annotation_position='top right',
)
fig.update_layout(bargap=0.05)
fig.show()

print(f"Mean rating : {done['overall_rating'].mean():.2f}")
print(f"Median      : {done['overall_rating'].median():.2f}")
print(f"Std dev     : {done['overall_rating'].std():.2f}")
Figure 4: Distribution of overall station ratings.
Mean rating : 4.16
Median      : 4.20
Std dev     : 0.28

Overall, the analyzed stations have a very high mean rating of 4.16, reflecting high satisfaction with the analyzed station facilities.

Show code
top10 = done.nlargest(10,  'overall_rating')[['name','overall_rating','review_count_google']]
bot10 = done.nsmallest(10, 'overall_rating')[['name','overall_rating','review_count_google']]

fig = make_subplots(rows=1, cols=2,
                    subplot_titles=['Top 10 Highest Rated', 'Bottom 10 Lowest Rated'],
                    horizontal_spacing=0.20)

fig.add_trace(go.Bar(
    x=top10['overall_rating'], y=top10['name'],
    orientation='h', marker_color='#1a9641',
    text=top10['overall_rating'].round(1), textposition='outside',
    name='Top 10',
), row=1, col=1)

fig.add_trace(go.Bar(
    x=bot10['overall_rating'], y=bot10['name'],
    orientation='h', marker_color='#d73027',
    text=bot10['overall_rating'].round(1), textposition='outside',
    name='Bottom 10',
), row=1, col=2)

fig.update_xaxes(range=[0, 5.5])
fig.update_yaxes(side='right', row=1, col=2)
fig.update_layout(height=400, showlegend=False, title_text='Best and Worst Rated SBB Stations')
fig.show()
Figure 5: Top 10 Best- and worst-rated stations.

Before jumping into the controversial cases, here is a quick leaderboard. I sort every station by its overall Google rating and compare it to the scraped average to spot-check that the data lines up. The table also displays the amount of ratings scraped and the total amount of reviews available.

Show code
# Per-station metrics
station_stats = reviews.groupby('opuic').agg(
    avg_rating    = ('rating', 'mean'),
    pct_with_text = ('text', lambda x: x.notna().mean() * 100),
    review_count  = ('id', 'count'),
).reset_index()

leaderboard = (
    done[['opuic','name','overall_rating','review_count_google']]
    .merge(station_stats, on='opuic', how='left')
    .sort_values('overall_rating', ascending=False)
    .reset_index(drop=True)
)
leaderboard.index += 1  # 1-based rank

leaderboard_display = leaderboard[[
    'name', 'overall_rating', 'avg_rating',
    'review_count_google', 'review_count', 'pct_with_text',
]].copy()
leaderboard_display.columns = [
    'Station', 'Google Rating', 'Avg Scraped Rating',
    'Reviews on Google', 'Reviews Scraped', '% with Text',
]
leaderboard_display = leaderboard_display.round(2)

print(f"Total reviews scraped across all stations: {int(leaderboard['review_count'].sum()):,}")
print()

with pd.option_context('display.max_rows', None):
    display(leaderboard_display)
Total reviews scraped across all stations: 22,596
Station Google Rating Avg Scraped Rating Reviews on Google Reviews Scraped % with Text
1 Rorschach Hafen 4.7 4.68 103 103 42.72
2 Rorschach Stadt 4.6 4.59 17 17 35.29
3 Genève-Eaux-Vives 4.6 4.56 39 39 51.28
4 Rapperswil SG 4.6 4.57 212 212 44.34
5 Thun 4.5 4.51 378 359 51.53
6 Rorschach 4.5 4.49 76 76 34.21
7 Locarno 4.5 4.46 135 135 48.89
8 Montreux 4.5 4.52 346 338 51.48
9 Nyon 4.5 4.52 243 243 42.39
10 Bern 4.5 4.46 1348 816 68.50
11 Lugano 4.4 4.37 871 672 64.14
12 Arth-Goldau 4.4 4.37 362 340 41.47
13 Brig 4.4 4.37 198 198 42.93
14 Zürich HB 4.4 4.33 5250 2223 99.73
15 Luzern 4.4 4.40 7155 2243 97.82
16 Chur 4.3 4.32 495 465 55.91
17 Schaffhausen 4.3 4.33 270 267 53.93
18 St. Gallen 4.3 4.27 461 414 50.00
19 Zug 4.3 4.31 406 376 43.88
20 Opfikon 4.3 4.30 23 23 47.83
21 Bellinzona 4.3 4.28 376 341 40.76
22 Martigny 4.3 4.26 120 120 40.00
23 Kreuzlingen Hafen 4.3 4.32 47 47 51.06
24 Neuchâtel 4.3 4.28 224 223 42.15
25 Zürich Enge 4.2 4.19 134 134 42.54
26 Zürich Stadelhofen 4.2 4.16 691 541 57.86
27 Winterthur 4.2 4.19 650 496 54.03
28 Sargans 4.2 4.25 153 153 54.90
29 Delémont 4.2 4.25 73 73 45.21
30 Burgdorf 4.2 4.20 71 71 40.85
31 Zürich Oerlikon 4.2 4.11 2457 1150 78.52
32 Genève 4.2 4.18 1046 750 68.40
33 Baden 4.1 4.16 1439 643 62.83
34 Landquart 4.1 4.05 79 79 39.24
35 Kreuzlingen 4.1 4.05 80 80 51.25
36 Basel SBB 4.1 3.97 2171 1323 79.82
37 Rotkreuz 4.1 4.14 99 99 36.36
38 Frauenfeld 4.1 4.13 115 115 40.87
39 Bülach 4.1 4.10 78 78 42.31
40 Fribourg/Freiburg 4.1 4.10 496 406 43.84
41 La Chaux-de-Fonds 4.1 4.13 136 136 50.00
42 Vevey 4.1 4.14 183 183 46.45
43 Sion 4.1 4.05 221 221 42.08
44 Biel/Bienne 4.1 4.06 568 452 50.66
45 Visp 4.1 4.14 256 256 53.12
46 Lausanne 4.0 3.93 2069 1008 74.40
47 Solothurn 3.9 3.89 686 500 53.60
48 Uster 3.9 3.89 394 362 40.88
49 Zürich Hardbrücke 3.9 3.92 280 277 41.88
50 Brugg AG 3.9 3.87 178 178 46.63
51 Oensingen 3.9 3.90 41 41 36.59
52 Glattbrugg 3.8 3.79 48 48 33.33
53 Bern Europaplatz 3.8 3.80 25 25 36.00
54 Aarau 3.8 3.84 366 358 51.12
55 Wil SG 3.8 3.81 187 187 44.92
56 Bern Wankdorf 3.8 3.83 60 60 53.33
57 Olten 3.8 3.76 554 459 50.11
58 Langenthal 3.7 3.66 94 94 44.68
59 Lenzburg 3.7 3.66 301 288 48.61
60 Genève-Aéroport 3.6 3.64 369 363 57.30
61 Yverdon-les-Bains 3.3 3.13 1151 619 60.74

Most Controversial Stations

To find the most controversial cases, let’s look at stations with the highest standard deviation in ratings, where some reviewers give 5 stars and others give 1. Only stations with at least 30 reviews are included to filter out statistical noise.

Show code
MIN_REVIEWS = 30  # exclude stations with too few reviews for meaningful variance

controversy = (
    reviews.groupby('opuic')
    .agg(std=('rating', 'std'), mean=('rating', 'mean'), count=('rating', 'count'))
    .reset_index()
    .merge(done[['opuic', 'name', 'overall_rating']], on='opuic', how='inner')
)
controversy = controversy[controversy['count'] >= MIN_REVIEWS].copy()
controversy = controversy.sort_values('std', ascending=False).reset_index(drop=True)
controversy['std']  = controversy['std'].round(2)
controversy['mean'] = controversy['mean'].round(2)

top_n = 15

# Explicit order for both charts: ascending std so most controversial appears at top
order15 = controversy.head(top_n).sort_values('std')['name'].tolist()
order10 = controversy.head(10).sort_values('std')['name'].tolist()

# Bar chart: std dev coloured by mean rating
fig = px.bar(
    controversy.head(top_n),
    x='std', y='name',
    orientation='h',
    color='mean',
    color_continuous_scale=RATING_SCALE,
    range_color=[1, 5],
    text='std',
    category_orders={'name': order15},
    title=f'Top {top_n} Most Controversial Stations (highest rating std dev)',
    labels={'std': 'Std deviation', 'name': '', 'mean': 'Avg rating'},
)
fig.update_traces(textposition='outside')
fig.update_layout(
    coloraxis_colorbar_title='Avg rating',
    margin_r=60,
    height=480,
)
fig.show()

# Box plot: rating distribution for top 10 most controversial
top10_names = controversy.head(10)['name'].tolist()
plot_df = (
    reviews
    .merge(done[['opuic', 'name']], on='opuic', how='left')
    [lambda df: df['name'].isin(top10_names)]
)

print(f"Most controversial (≥{MIN_REVIEWS} reviews):")
print(controversy.head(10)[['name', 'overall_rating', 'mean', 'std', 'count']]
      .rename(columns={'overall_rating': 'google_rating', 'mean': 'avg_scraped', 'count': 'reviews'})
      .to_string(index=False))
print()
print("Most consistent:")
print(controversy.tail(5)[['name', 'overall_rating', 'mean', 'std', 'count']]
      .rename(columns={'overall_rating': 'google_rating', 'mean': 'avg_scraped', 'count': 'reviews'})
      .to_string(index=False))
Figure 6: Most polarising stations: highest standard deviation in scraped review ratings.
Most controversial (≥30 reviews):
             name  google_rating  avg_scraped  std  reviews
  Genève-Aéroport            3.6         3.64 1.51      363
        Oensingen            3.9         3.90 1.48       41
           Wil SG            3.8         3.81 1.47      187
       Langenthal            3.7         3.66 1.41       94
            Aarau            3.8         3.84 1.37      358
         Lenzburg            3.7         3.66 1.36      288
        Basel SBB            4.1         3.97 1.34     1323
            Olten            3.8         3.76 1.33      459
             Sion            4.1         4.05 1.33      221
Yverdon-les-Bains            3.3         3.13 1.32      619

Most consistent:
           name  google_rating  avg_scraped  std  reviews
           Thun            4.5         4.51 0.85      359
       Montreux            4.5         4.52 0.84      338
  Rapperswil SG            4.6         4.57 0.83      212
           Nyon            4.5         4.52 0.81      243
Rorschach Hafen            4.7         4.68 0.56      103

The most controversial stations cluster around 3.3–4.1 stars. This is solidly mid-range, but their high standard deviations reveal split opinions rather than universal mediocrity. These are stations where some travellers have a perfectly fine experience while others are frustrated, likely driven by specific pain points (safety, cleanliness, crowds) that don’t affect everyone equally.

The most consistent stations, by contrast, are all top-rated (4.5+). This suggests that genuinely good stations leave little room for disagreement. Consistency and quality go hand in hand. No station is consistently bad, they’re just consistently forgettable or consistently good.

TipRatings at a glance

The network mean is 4.16 stars with most stations close to it. The interesting cases are the polarising stations (Oensingen, Wil SG, Genève-Aéroport) where reviewers actively disagree about the same place. For Oensingen and Wil SG a big improvement in rating can be observed. Due to the previous lower ratings the controversial factor is higher. For Genève-Aéroport, many reviewers are also rating the airport itself, potentially leading to a discrpancy in ratings between the train station and the airport.

5: Language & Regional Patterns

Switzerland has four official languages. I look at whether station ratings differ by linguistic region, and how the mix of review languages varies across the country.

Rating by Linguistic Region

I group stations by linguistic region (German, French, Italian) and compare average ratings to see if travellers in one part of Switzerland are systematically harsher than in another. Romansh is not included as no analyzed station is within the language area.

Show code
# Use the linguistic_region column added to stations.csv
done['region'] = done['linguistic_region']

region_stats = done.groupby('region').agg(
    stations=('opuic','count'),
    avg_rating=('overall_rating','mean'),
    median_rating=('overall_rating','median'),
).reset_index()

print(region_stats.to_string(index=False))
          region  stations  avg_rating  median_rating
       Bilingual         2    4.100000            4.1
French (Romandy)        13    4.138462            4.2
          German        43    4.151163            4.2
Italian (Ticino)         3    4.400000            4.4
Figure 9

Station ratings are roughly consistent across linguistic regions, averaging 4.1-4.4 stars. The small differences between regions are not statistically meaningful given the per-region sample sizes (n=2 to n=44 stations).

Language Mix by Station

Which stations attract the most multilingual review bases? I take the 20 most-reviewed stations and break down their written reviews by reviewer language. This is a proxy for which stations have the most international vs local traffic.

Show code
# Language mix for the 20 most-reviewed stations (written reviews only)
top20_opuic = done.nlargest(20, 'scraped_reviews')['opuic'].tolist()
top20_rev   = written[written['opuic'].isin(top20_opuic)].copy()
top20_names = done.set_index('opuic')['name'].to_dict()
top20_rev['station_name'] = top20_rev['opuic'].map(top20_names)

lang_mix = (
    top20_rev.groupby(['station_name','lang'])
    .size()
    .reset_index(name='count')
)

# Sort stations by total written review count
order = top20_rev.groupby('station_name').size().sort_values(ascending=True).index.tolist()

fig = px.bar(
    lang_mix, x='count', y='station_name',
    color='lang',
    orientation='h',
    category_orders={'station_name': order},
    color_discrete_map=LANG_COLORS,
    title='Review Language Mix: Top 20 Stations (written reviews only)',
    labels={'count': 'Number of reviews', 'station_name': '', 'lang': 'Language'},
    height=600,
)
fig.show()
Figure 10: Language mix at the 20 most-reviewed stations. Tourist hubs like Zürich HB and Genève-Aéroport skew toward English, while regional stations lean toward the local language.

Not surprisingly, the train stations which are more frequented by tourists, such as Zürich HB and Luzern show that a large amount of reviews is in english. However, all stations have a large chunk of the reviews still in the language of the local canton.

Does the Reviewer’s Language Affect Ratings?

Another question which I asked myself while looking at the data is if the reviewer’s language influences the ratings in the respective linguistic regions of Switzerland: So for example, how does someone from a swiss german canton rate train stations in romandy?

For simplicity, the assumption is made that reviews written in languages which are not native to Switzerland are considered to be written by tourists.

Show code
# ── Rating distribution by reviewer language ─────────────────────────────────
SWISS_LANGS = {'de', 'fr', 'it', 'rm'}
GROUP_LABELS = {'de': 'German', 'fr': 'French', 'it': 'Italian', 'rm': 'Romansh'}

rev_lang = reviews.copy()
rev_lang['reviewer_label'] = rev_lang['language'].apply(
    lambda l: GROUP_LABELS.get(l, 'Tourist')
)

order = ['German', 'French', 'Italian', 'Tourist']
plot_df = rev_lang[rev_lang['reviewer_label'].isin(order)]

fig = make_subplots(rows=1, cols=4, subplot_titles=order, shared_yaxes=True)

for col, label in enumerate(order, 1):
    subset = plot_df[plot_df['reviewer_label'] == label]
    counts = subset['rating'].value_counts().reindex([1,2,3,4,5], fill_value=0)
    pcts = (counts / counts.sum() * 100).round(1)
    fig.add_trace(
        go.Bar(
            x=pcts.index, y=pcts.values,
            marker_color=[RATING_SCALE[r-1] for r in pcts.index],
            showlegend=False,
            text=[f'{v:.0f}%' for v in pcts.values], textposition='outside',
        ),
        row=1, col=col,
    )
    fig.update_xaxes(tickvals=[1,2,3,4,5], title_text='Stars', row=1, col=col)

fig.update_yaxes(title_text='% of reviews', range=[0, 65], row=1, col=1)
fig.update_layout(
    title='Rating Distribution by Reviewer Language (% of group)',
    height=350,
)
fig.show()

print(plot_df.groupby('reviewer_label')['rating'].agg(['mean','median','count']).loc[order].round(2))
Figure 11: Rating distribution by reviewer language.
                mean  median  count
reviewer_label                     
German          3.88     4.0   5549
French          3.69     4.0   1727
Italian         4.25     5.0    577
Tourist         4.31     5.0  14768

Interestingly, tourists often have the best ratings. The harshest reviews seem to come from french speaking reviewers, followed by the german speaking reviewers. One questions which results out of this is the following: Do people rate stations the same way when being in a region, that does not speak their primary language? As an example: Does someone who speaks german rate the stations in their language region the same as the stations in another language region in Switzerland?

Show code
# Cross-regional bias heatmap: reviewer language × station region
rev_region = reviews.merge(done[['opuic','linguistic_region']], on='opuic', how='left')
rev_region['reviewer_label'] = rev_region['language'].apply(
    lambda l: {'de':'German','fr':'French','it':'Italian'}.get(l, 'Tourist')
)

pivot = (
    rev_region
    .groupby(['reviewer_label','linguistic_region'])['rating']
    .mean()
    .unstack()
)

row_order = ['German','French','Italian','Tourist']
col_order = ['German','French (Romandy)','Italian (Ticino)','Bilingual']
pivot = pivot.reindex(
    index=[r for r in row_order if r in pivot.index],
    columns=[c for c in col_order if c in pivot.columns],
)

fig = px.imshow(
    pivot,
    color_continuous_scale='RdYlGn',
    range_color=[3.5, 5.0],
    text_auto='.2f',
    title='Average Rating: Reviewer Language × Station Linguistic Region',
    labels={'x': 'Station region', 'y': 'Reviewer language', 'color': 'Avg rating'},
    aspect='auto',
)
fig.update_layout(coloraxis_colorbar_title='Avg rating', height=350)
fig.show()
Figure 12: Reviewer language by station region: which combinations are harshest.

Reviewers are harshest in their home region. German speakers rate the Swiss German-region stations lowest (3.86), French speakers rate Romandy stations lowest (3.53), and Italian speakers reserve their lowest scores for Ticino (4.00). Even the bilingual regions of Biel/Bienne and Fribourg/Freiburg follow this pattern. This likely reflects the familiarity effect. Daily commuters notice every flaw, while visitors passing through tend to rate the overall experience more generously.

TipLanguage and region at a glance

Linguistic regions rate similarly on average, but reviewers are harshest in their home region. Tourists writing a language which is not German, French or Italian give the most generous ratings overall.

6: What Travellers Say

Star ratings are useful, but a lot of insight can be gained in the reviews. I look at the most common word pairs (bigrams) used by happy vs unhappy reviewers (stop words are removed), identify the most common topics with a keyword-based pass, and then shift to an AI classifier which, in addition to detecting the topic being talked about also detects the sentiment and summarizes the gist of the issue.

Show code
# Build multilingual stopword set (NLTK base + domain-specific terms)
STOP = set()
for lang in ['german', 'french', 'english', 'italian']:
    STOP.update(stopwords.words(lang))

# Domain-specific terms not covered by NLTK
STOP.update([
    'bahnhof', 'gare', 'stazione', 'station', 'train', 'zug', 'bahn',
    'treno', 'sbb', 'good', 'great', 'nice', 'well', 'really',
    'place', 'très', 'molto', 'sehr',
])

negative = reviews[reviews['rating'] <= 2]['text']
Show code
# Top bigrams per rating tier
def top_ngrams(text_series, n=2, top_k=15):
    corpus = text_series.dropna().str.lower().tolist()
    vec = CountVectorizer(ngram_range=(n,n), stop_words=list(STOP), min_df=2)
    X   = vec.fit_transform(corpus)
    counts = X.sum(axis=0).A1
    terms  = vec.get_feature_names_out()
    return pd.Series(counts, index=terms).nlargest(top_k)

tiers = {
    '1–2 ★': reviews[reviews['rating'] <= 2]['text'],
    '3 ★':   reviews[reviews['rating'] == 3]['text'],
    '4–5 ★': reviews[reviews['rating'] >= 4]['text'],
}

fig = make_subplots(rows=1, cols=3, subplot_titles=list(tiers.keys()),
                    horizontal_spacing=0.12)
colors = ['#d73027', '#fdae61', '#1a9641']

tier_ngrams = {}
for col, (tier, series), color in zip(range(1,4), tiers.items(), colors):
    ng = top_ngrams(series, n=2, top_k=12)
    tier_ngrams[tier] = ng
    fig.add_trace(
        go.Bar(x=ng.values, y=ng.index, orientation='h',
               marker_color=color, showlegend=False),
        row=1, col=col,
    )

fig.update_layout(height=450, width=1100, title_text='Most Common Bigrams by Rating Tier')
fig.show()
Figure 13: Most common bigrams per rating tier.

1-2 star reviews focus on concrete problems: unsafe atmosphere (“mal fréquenté/fréquentée”), overcrowding (“immer mehr”, “beaucoup trop”), and specific stations which face problems, such as Zürich HB. Passport control and 1st class issues at also surface.

3 star reviews are ambivalent. Stations “fulfil their purpose” and are “ganz ok”, with shops and good connections mentioned, but recurring issues dampen the experience (“schöner, leider…”, “leider oft”, “seit Jahren”). “Nothing special” is the single most common phrase.

4-5 star reviews show a clear pattern: satisfied reviewers frequently rate the city rather than the station (“schöne Stadt”, “beautiful city”, “belle ville”, “old town”). Practical qualities like navigation (“easy navigate”), connections, and shopping also feature prominently. The much higher counts in this tier reflect both the larger volume of positive reviews and the more consistent vocabulary happy reviewers use.

TipWhat the positive bigrams are about

Many of the positive bigrams (“beautiful city”, “great place”, “must visit”) are not really about the station at all. Some stations sit inside or right next to a famous destination (Genève-Aéroport, Lausanne, Lugano, Locarno), and reviewers often praise the city or the airport rather than the platform, signage, or facilities. Their 5-star review is honest, but it inflates the station’s rating for reasons SBB cannot directly influence.

Show code
# Review length vs star rating
written['text_len'] = written['text'].str.len()

fig = px.box(
    written, x='rating', y='text_len',
    color='rating',
    color_discrete_sequence=RATING_SCALE,
    title='Review Length by Star Rating (written reviews only)',
    labels={'text_len': 'Characters', 'rating': 'Star rating'},
    category_orders={'rating': [1,2,3,4,5]},
)
fig.update_layout(showlegend=False)
fig.show()

print(written.groupby('rating')['text_len'].median().rename('median_chars').astype(int))
Figure 14: Character count of reviews by star rating. Unhappy travellers write substantially more: 1-star reviews have a median of 124 characters, vs 49 for 5-star reviews.
rating
1    124
2     99
3     63
4     59
5     49
Name: median_chars, dtype: int64

Observation: even though the review text is cut off after a certain length (~340 characters), the pattern still shows that lower-rated reviews tend to be more wordy. Unhappy travellers have more to say.

Complaint Analysis

Low-star reviews contain the most actionable signal. As a first pass, I use a naive keyword-anchored approach. I match reviews against keywords in the four Swiss languages to bucket them into aspects (Cleanliness, Safety, Connections, etc.). This shows roughly what people talk about and where pain points concentrate, but has clear limitations that I revisit further down with an AI-based approach.

# Aspect-based complaint analysis (keyword anchoring)
ASPECTS = {
    'Cleanliness':      ['clean','dirty','sauber','dreckig','schmutzig','propre','sale','müll','abfall','filth',
                         'geruch','stink','smell','odeur','hygiene','graffiti','ordentlich','sporco','pulito','immondizia','puzza','déchets'],
    'Toilets':          ['toilet','wc','restroom','bathroom','toilette','klo',
                         'geschlossen','closed','fermé','kostenpflichtig','pay','bagno','sanitär'],
    'Lifts/Escalators': ['lift','elevator','escalator','rolltreppe','aufzug','ascenseur','escalier roulant',
                         'defekt','broken','kaputt','out of order','hors service','treppe','stairs','escalier','ascensore'],
    'Food & Shops':     ['shop','restaurant','food','essen','kiosk','migros','coop','café','coffee','kaffee','snack',
                         'bar','bakery','bäckerei','supermarché','laden','bistro','takeaway'],
    'Safety':           ['safe','unsafe','sicher','unsicher','security','polizei','dunkel','dark','gefährlich',
                         'drug','drogen','drogue','droga','dealer','needle','nadel','seringue','siringa','junkie','süchtig','rauschgift',
                         'diebstahl','theft','vol','betrunken','drunk','ivre','ubriaco','aggressiv','aggression','belästigung','gewalt','pericoloso'],
    'Signage/Nav':      ['signage','confus','wegweiser','orient','übersicht','indication','panneau',
                         'schild','beschilderung','anzeigetafel','orientation','display','abfahrt','departures','unübersichtlich'],
    'Parking/Bikes':    ['parking','parkplatz','parkhaus','velo','fahrrad','bike','vélo',
                         'e-bike','velostall','fahrradständer','gestohlen','stolen','moto'],
    'Connections':      ['connection','anschluss','verspätung','delay','pünktlich','correspondance','retard',
                         'missed','verpasst','ausfall','cancel','gleis','platform','voie','binario','fahrplan','horaire','ritardo'],
    'Crowds':           ['crowd','overcrowd','voll','überfüllt','bondé','busy',
                         'gedränge','rush hour','stosszeit','queue','warteschlange','heures de pointe'],
    'Accessibility':    ['wheelchair','rollstuhl','handicap','barrier','barriere','accessible','behinderung',
                         'ramp','rampe','kinderwagen','stroller','poussette','senior','elderly','blind'],
    'Seating/Waiting':  ['bench','seat','sitz','sitzplatz','banc','panchina','waiting area','warteplatz',
                         'warteraum',"salle d'attente","sala d'aspetto"],
    'Staff/Service':    ['staff','personal','mitarbeiter','freundlich','rude','helpful','hilfe','personnel',
                         'service','unhelpful','unfreundlich','scortese','aimable','impoli'],
}
Figure 15
Show code
all_stars = reviews['text'].dropna().str.lower()

aspect_counts = {
    asp: all_stars.str.contains('|'.join(kws), regex=True).sum()
    for asp, kws in ASPECTS.items()
}
aspect_df = (
    pd.Series(aspect_counts)
    .sort_values()
    .reset_index()
    .rename(columns={'index':'aspect', 0:'mentions'})
)
aspect_df.columns = ['aspect','mentions']
aspect_df['pct'] = (aspect_df['mentions'] / len(all_stars) * 100).round(1)

fig = px.bar(
    aspect_df, x='mentions', y='aspect',
    orientation='h',
    text=aspect_df['pct'].map('{:.1f}%'.format),
    title=f'Most Mentioned Topics in all Reviews ({len(all_stars):,} reviews)',
    labels={'mentions': 'Reviews mentioning topic', 'aspect': ''},
    color='aspect',
    color_discrete_map=ASPECT_COLORS,
)
fig.update_traces(textposition='outside')
fig.update_layout(showlegend=False, margin_r=80, height=500)
fig.show()

Food & Shops is the most discussed aspect (12.4%), followed by Cleanliness (8.4%) and Connections (5.5%). Together, these three account for over a quarter of all reviews. The lower bars (Toilets, Safety, Crowds, etc.) are sparser but more diagnostic, since they almost always indicate problems rather than praise. To focus on what is being talked about in negative reviews, I restrict the next chart to low-rated reviews only.

Show code
# Aspect mentions in 1-3 star reviews only
low_star_12 = reviews[reviews['rating'] <= 3]['text'].dropna().str.lower()

aspect_counts_12 = {
    asp: low_star_12.str.contains('|'.join(kws), regex=True).sum()
    for asp, kws in ASPECTS.items()
}
aspect_df_12 = (
    pd.Series(aspect_counts_12)
    .sort_values()
    .reset_index()
)
aspect_df_12.columns = ['aspect', 'mentions']
aspect_df_12['pct'] = (aspect_df_12['mentions'] / len(low_star_12) * 100).round(1)

fig = px.bar(
    aspect_df_12, x='mentions', y='aspect',
    orientation='h',
    text=aspect_df_12['pct'].map('{:.1f}%'.format),
    title=f'Most Mentioned Complaint Topics in 1-3 Star Reviews ({len(low_star_12):,} reviews)',
    labels={'mentions': 'Reviews mentioning topic', 'aspect': ''},
    color='aspect',
    color_discrete_map=ASPECT_COLORS,
)
fig.update_traces(textposition='outside')
fig.update_layout(showlegend=False, margin_r=80, height=500)
fig.show()
Figure 16: Aspect mentions in 1-3 star reviews only.
Show code
# Per-station dominant complaint topic (stations rated <= 3.0, >= 10 complaints)
low_rev = reviews[reviews['rating'] <= 3].merge(done[['opuic','name','overall_rating']], on='opuic', how='left').copy()
low_rev = low_rev[low_rev['overall_rating'] <= 4.5]
low_rev['text_lower'] = low_rev['text'].str.lower()

rows = []
for opuic, grp in low_rev.groupby('opuic'):
    txt = grp['text_lower'].dropna()
    if len(txt) < 5:
        continue
    scores = {asp: txt.str.contains('|'.join(kws), regex=True).sum() for asp, kws in ASPECTS.items()}
    top_asp = max(scores, key=scores.get)
    rows.append({
        'Station': grp['name'].iloc[0],
        'Rating': grp['overall_rating'].iloc[0],
        'Top complaint': top_asp,
        'Mentions': scores[top_asp],
        'Reviews analysed': len(txt),
        '% mentioning of written station reviews': round(scores[top_asp] / len(txt) * 100, 1),
    })

pain_df = (
    pd.DataFrame(rows)
    .sort_values('Rating', ascending=True)
    .reset_index(drop=True)
)
pain_df.index += 1
pain_df
Station Rating Top complaint Mentions Reviews analysed % mentioning of written station reviews
1 Yverdon-les-Bains 3.3 Safety 31 231 13.4
2 Genève-Aéroport 3.6 Safety 18 104 17.3
3 Langenthal 3.7 Connections 5 21 23.8
4 Lenzburg 3.7 Crowds 12 58 20.7
5 Glattbrugg 3.8 Cleanliness 1 11 9.1
6 Bern Europaplatz 3.8 Food & Shops 1 5 20.0
7 Wil SG 3.8 Safety 11 43 25.6
8 Aarau 3.8 Safety 17 84 20.2
9 Bern Wankdorf 3.8 Lifts/Escalators 4 13 30.8
10 Olten 3.8 Connections 15 91 16.5
11 Solothurn 3.9 Safety 9 81 11.1
12 Oensingen 3.9 Toilets 2 5 40.0
13 Uster 3.9 Food & Shops 11 52 21.2
14 Zürich Hardbrücke 3.9 Cleanliness 7 53 13.2
15 Brugg AG 3.9 Cleanliness 7 31 22.6
16 Lausanne 4.0 Connections 37 219 16.9
17 La Chaux-de-Fonds 4.1 Staff/Service 2 17 11.8
18 Landquart 4.1 Signage/Nav 1 7 14.3
19 Vevey 4.1 Safety 8 29 27.6
20 Kreuzlingen 4.1 Safety 2 9 22.2
21 Frauenfeld 4.1 Toilets 3 14 21.4
22 Biel/Bienne 4.1 Staff/Service 9 67 13.4
23 Fribourg/Freiburg 4.1 Safety 8 51 15.7
24 Baden 4.1 Food & Shops 9 85 10.6
25 Bülach 4.1 Cleanliness 2 10 20.0
26 Basel SBB 4.1 Toilets 40 294 13.6
27 Sion 4.1 Staff/Service 3 28 10.7
28 Visp 4.1 Staff/Service 5 39 12.8
29 Rotkreuz 4.1 Cleanliness 2 9 22.2
30 Sargans 4.2 Cleanliness 4 16 25.0
31 Delémont 4.2 Toilets 1 5 20.0
32 Burgdorf 4.2 Safety 2 5 40.0
33 Winterthur 4.2 Cleanliness 8 61 13.1
34 Genève 4.2 Toilets 20 128 15.6
35 Zürich Stadelhofen 4.2 Connections 9 76 11.8
36 Zürich Enge 4.2 Cleanliness 2 12 16.7
37 Zürich Oerlikon 4.2 Food & Shops 19 198 9.6
38 Bellinzona 4.3 Toilets 4 32 12.5
39 Martigny 4.3 Toilets 1 9 11.1
40 Neuchâtel 4.3 Toilets 2 20 10.0
41 Zug 4.3 Toilets 8 39 20.5
42 St. Gallen 4.3 Toilets 12 54 22.2
43 Schaffhausen 4.3 Toilets 4 28 14.3
44 Chur 4.3 Safety 11 49 22.4
45 Brig 4.4 Food & Shops 4 14 28.6
46 Luzern 4.4 Food & Shops 42 274 15.3
47 Arth-Goldau 4.4 Cleanliness 3 22 13.6
48 Lugano 4.4 Toilets 6 67 9.0
49 Zürich HB 4.4 Cleanliness 47 335 14.0
50 Montreux 4.5 Staff/Service 3 21 14.3
51 Nyon 4.5 Cleanliness 2 11 18.2
52 Bern 4.5 Cleanliness 6 83 7.2
53 Thun 4.5 Connections 3 22 13.6
54 Locarno 4.5 Signage/Nav 2 11 18.2
Figure 17: Dominant complaint topic per station (keyword anchoring). Sorted by rating of the station.
WarningLimitations of the keyword-anchored approach

The keyword-based approach is a useful first pass, but it has clear limitations:

  1. It cannot distinguish sentiment. “clean” matches both “very clean station” and “not clean at all”. A 3-star review saying “the food was great but connections were terrible” gets counted under both Food & Shops and Connections as complaints, even though only one is negative.
  2. It double-counts multi-aspect reviews. One review mentioning “toilet” and “dirty” gets counted under both Toilets and Cleanliness, inflating totals.
  3. Some keywords are ambiguous. “bar” (Food & Shops) matches “bar” in other contexts. “closed” (Toilets) matches anything being closed. “dark” (Safety) could describe atmosphere or aesthetics.
  4. It misses complaints phrased without keywords. “I had to wait 45 minutes for the next train” has no Connections keyword.

Today’s AI models are reasonably good at extracting aspect-level sentiment, so i the next section I will use an AI-model to classify sentiment and the topics (aspects) the reviews are talking about.

AI-Powered Aspect Classification

The keyword approach above cannot distinguish a complaint from a compliment. To fix this I run every review through GPT-4o-mini (see src/sbb_reviews/analysis/classify_aspects.py) with a structured prompt that asks the model to return, for each review:

  • which aspect is mentioned (constrained to a fixed list of 12: Cleanliness, Toilets, Lifts/Escalators, Food & Shops, Safety, Signage/Nav, Parking/Bikes, Connections, Crowds, Accessibility, Seating/Waiting, Staff/Service)
  • whether it is a complaint, praise, or neutral observation
  • a concrete reason in the model’s own words (e.g. “drug addicts loitering near entrance” rather than just Safety)
  • the exact phrases from the review text that support the classification

Reviews are processed in batches of 10. The fixed aspect list should prevent the model from inventing categories, and the requirement to quote supporting phrases makes spot-checking easy.

Run sbb-reviews classify once to generate data/derived/complaint_aspects_ai.csv, then execute the cells below to explore the results.

Throughout the rest of this section I limit each aspect-level chart to aspects with at least MIN_COMPLAINTS_PER_ASPECT = 10 AI-classified negative mentions across the network. Aspects with fewer mentions are too sparse to support reliable comparisons of severity or volume, so they are dropped from the upcoming visualizations.

Approximately 2-3% of reviews could not be classified because the model’s batch response exceeded the token limit and was truncated mid-output. Those batches were skipped rather than partially saved. I spot-checked roughly 50 random classifications and found the aspect labels and sentiments to be consistent with the source text. The results are representative but not exhaustive.

Show code
# Load AI classification results
AI_CSV = pathlib.Path('../data/derived/complaint_aspects_ai.csv')

if not AI_CSV.exists():
    print("complaint_aspects_ai.csv not found in data/derived/ -- run `sbb-reviews classify` first.")
    print("Run: python analysis/run_ai_classification.py")
else:
    ai = pd.read_csv(AI_CSV)
    print(f"Rows loaded: {len(ai):,}")
    print(f"Reviews: {ai.drop_duplicates(subset=['station_name','date_estimated','text']).shape[0]:,}")
    print()
    print("Sentiment breakdown:")
    print(ai['sentiment'].value_counts().to_string())
    print()
    print(f"Aspect breakdown (all sentiments with count >= {MIN_COMPLAINTS_PER_ASPECT}):")
    print(ai['aspect'].value_counts()[lambda x: x >= MIN_COMPLAINTS_PER_ASPECT].to_string())
Rows loaded: 11,857
Reviews: 8,813

Sentiment breakdown:
sentiment
positive    7354
negative    3430
neutral     1073

Aspect breakdown (all sentiments with count >= 10):
aspect
Food & Shops        2852
Connections         1833
Cleanliness         1710
Signage/Nav         1135
Staff/Service        998
Crowds               874
Safety               740
Accessibility        545
Toilets              436
Seating/Waiting      352
Parking/Bikes        196
Lifts/Escalators     114

Sanity check: Keyword vs AI

To get an overview how much the keyword-based and AI-based approach differ, I compare the per-aspect share between the keyword approach (low-star reviews only) and the AI approach (negative-sentiment mentions only).

Show code
# Keyword vs AI: per-aspect share of complaints
if 'ai' in dir() and len(ai):
    # Keyword counts in low-star (<=3) reviews
    low_kw = reviews[reviews['rating'] <= 3]['text'].dropna().str.lower()
    kw_counts = {asp: low_kw.str.contains('|'.join(kws), regex=True).sum() for asp, kws in ASPECTS.items()}
    kw_df = pd.DataFrame({'aspect': list(kw_counts.keys()), 'count': list(kw_counts.values())})
    kw_df['method'] = 'Keyword (low-star)'
    kw_df['pct'] = kw_df['count'] / kw_df['count'].sum() * 100

    # AI counts in negative-sentiment mentions
    ai_neg = ai[ai['sentiment'] == 'negative']
    ai_df = ai_neg['aspect'].value_counts().reset_index()
    ai_df.columns = ['aspect', 'count']
    # Apply the section-wide aspect floor
    ai_df = ai_df[ai_df['count'] >= MIN_COMPLAINTS_PER_ASPECT]
    kw_df = kw_df[kw_df['aspect'].isin(ai_df['aspect'])]
    ai_df['method'] = 'AI (negative sentiment)'
    ai_df['pct'] = ai_df['count'] / ai_df['count'].sum() * 100

    compare = pd.concat([kw_df, ai_df], ignore_index=True)
    # Order by total share for a clean visual
    order = compare.groupby('aspect')['pct'].sum().sort_values(ascending=False).index.tolist()

    fig = px.bar(
        compare, x='aspect', y='pct', color='method', barmode='group',
        category_orders={'aspect': order},
        color_discrete_map={'Keyword (low-star)': '#9e9e9e', 'AI (negative sentiment)': '#1976d2'},
        title='Per-aspect share of complaints: keyword vs AI',
        labels={'pct': '% of method\'s total complaints', 'aspect': ''},
    )
    fig.update_layout(xaxis_tickangle=-30, legend_title='', height=450)
    fig.show()

    # Print the diff so the magnitudes are explicit
    pivot = compare.pivot(index='aspect', columns='method', values='pct').fillna(0).round(1)
    pivot['diff (AI - KW)'] = (pivot['AI (negative sentiment)'] - pivot['Keyword (low-star)']).round(1)
    print('Per-aspect share (%):')
    print(pivot.sort_values('AI (negative sentiment)', ascending=False).to_string())
Figure 18: Per-aspect share of complaints: keyword approach vs AI classifier.
Per-aspect share (%):
method            AI (negative sentiment)  Keyword (low-star)  diff (AI - KW)
aspect                                                                       
Safety                               16.8                12.6             4.2
Crowds                               12.9                 6.3             6.6
Staff/Service                        11.1                10.0             1.1
Signage/Nav                          10.2                 6.0             4.2
Toilets                              10.0                12.5            -2.5
Cleanliness                           9.4                13.1            -3.7
Connections                           8.4                13.5            -5.1
Food & Shops                          7.0                13.3            -6.3
Seating/Waiting                       5.7                 4.3             1.4
Accessibility                         4.4                 2.2             2.2
Parking/Bikes                         2.6                 2.8            -0.2
Lifts/Escalators                      1.7                 3.3            -1.6

The two methods rank the top aspects differently: the AI puts Safety and Crowds first, while the keyword approach puts Connections, Food & Shops, and Cleanliness on top because their keywords match plenty of positive mentions inside low-star reviews (“great food”, “good connections”), inflating those counts. The AI, by filtering on sentiment, also catches more Crowds and Signage/Nav complaints whose phrasing avoids the obvious keywords. The lists still overlap on multiple topics (Safety, Toilets, Cleanliness, Staff/Service).

Show code
# Complaint counts per aspect (negative only)
if 'ai' in dir() and len(ai):
    neg = ai[ai['sentiment'] == 'negative'].copy()

    aspect_counts = neg['aspect'].value_counts().reset_index()
    aspect_counts.columns = ['aspect', 'complaints']
    aspect_counts = aspect_counts[aspect_counts['complaints'] >= MIN_COMPLAINTS_PER_ASPECT]

    fig = px.bar(
        aspect_counts, x='aspect', y='complaints',
        color='aspect', color_discrete_map=ASPECT_COLORS,
        title=f'Complaint topics: AI classification ({len(neg):,} negative mentions)',
        labels={'complaints': 'Negative mentions', 'aspect': ''},
    )
    fig.update_layout(showlegend=False, xaxis_tickangle=-45, height=500)
    fig.show()
Figure 19: Total complaint counts per aspect, AI-classified negative mentions only.

Total Negative Mentions and Impact per Aspect

Show code
# Aspect severity ranking: which complaints are associated with the lowest ratings?
if 'ai' in dir() and len(ai):
    neg = ai[ai['sentiment'] == 'negative'].copy()

    severity = (
        neg.groupby('aspect')
        .agg(
            avg_rating=('rating', 'mean'),
            median_rating=('rating', 'median'),
            complaint_count=('rating', 'count'),
        )
        .sort_values('avg_rating')
        .reset_index()
    )
    severity = severity[severity['complaint_count'] >= MIN_COMPLAINTS_PER_ASPECT].reset_index(drop=True)
    severity.index += 1
    severity.index.name = 'severity_rank'
    severity['avg_rating'] = severity['avg_rating'].round(2)
    severity['label'] = severity.apply(lambda r: f"{r['avg_rating']:.2f} (n={int(r['complaint_count'])})", axis=1)

    print("Aspect severity ranking (lowest avg rating = most damaging):")
    display(severity)

    fig = px.bar(
        severity.sort_values('avg_rating'),
        x='aspect', y='avg_rating',
        color='avg_rating',
        color_continuous_scale='RdYlGn',
        title='Aspect Severity: Average Review Rating per Negative Aspect',
        labels={'avg_rating': 'Avg rating of reviews', 'aspect': ''},
        text='label',
    )
    fig.update_layout(
        coloraxis_showscale=False,
        xaxis_tickangle=-45,
        yaxis_range=[0, 5],
        height=500,
    )
    fig.update_traces(textposition='outside')
    fig.show()
Aspect severity ranking (lowest avg rating = most damaging):
aspect avg_rating median_rating complaint_count label
severity_rank
1 Staff/Service 1.64 1.0 379 1.64 (n=379)
2 Connections 1.85 1.0 285 1.85 (n=285)
3 Safety 2.02 2.0 573 2.02 (n=573)
4 Cleanliness 2.37 2.0 319 2.37 (n=319)
5 Toilets 2.50 2.0 339 2.50 (n=339)
6 Seating/Waiting 2.55 3.0 194 2.55 (n=194)
7 Accessibility 2.58 3.0 149 2.58 (n=149)
8 Food & Shops 2.67 3.0 239 2.67 (n=239)
9 Lifts/Escalators 2.68 3.0 57 2.68 (n=57)
10 Signage/Nav 2.75 3.0 346 2.75 (n=346)
11 Crowds 2.76 3.0 439 2.76 (n=439)
12 Parking/Bikes 2.86 3.0 87 2.86 (n=87)
(a) Average review rating per negative aspect: lowest = most damaging.
(b)
Figure 20

Negative Staff/Service and Connections complaints are the most severe, averaging just 1.6 and 1.9 stars respectively. These aspects appear in reviews where travellers are most frustrated, suggesting that poor staff interactions and missed or delayed connections trigger harsher ratings than other issues. Safety ranks third (2.0 stars avg) with the highest complaint volume (573 mentions), making it both severe and widespread.

In contrast, aspects like Parking/Bikes, Crowds, and Signage/Navigation average closer to 3 stars, meaning reviewers who complain about these still rate the station more moderately. This suggests that while these are common pain points, they are less likely to drive a 1-star review on their own.

The median rating reinforces the pattern: Staff/Service, Connections, and Safety all have a median of 1.0-2.0 stars, while lower-severity aspects cluster around a median of 3.0 stars.

Volume vs Severity in One View

The two previous charts show volume and severity separately, but the actionable view is both at once. Aspects in the bottom-right of the scatter are both widely complained about and dragging ratings down the most.

Show code
# Volume vs severity scatter
if 'ai' in dir() and len(ai):
    neg = ai[ai['sentiment'] == 'negative'].copy()
    vs = (
        neg.groupby('aspect')
        .agg(volume=('rating', 'count'), severity=('rating', 'mean'))
        .reset_index()
    )
    vs = vs[vs['volume'] >= MIN_COMPLAINTS_PER_ASPECT]

    fig = px.scatter(
        vs, x='volume', y='severity', text='aspect', size='volume', color='aspect',
        color_discrete_map=ASPECT_COLORS,
        title='Aspect Volume vs Severity (negative mentions, lower y = harsher ratings)',
        labels={'volume': 'Number of negative mentions', 'severity': 'Avg review rating'},
        size_max=55,
    )
    fig.update_traces(textposition='top center')
    fig.update_layout(showlegend=False, height=500, yaxis_range=[1, 4])
    fig.show()
Figure 21: Volume vs severity per aspect. Bottom-right = widespread and severe.

The previous charts show which complaint aspects are most common and most severe across the entire network, but they do not tell me where the problems are concentrated. The next cell breaks complaints down by station to answer: Which stations have the most complaints, and what are they about?

The table shows each station’s dominant complaint topic (the aspect with the most negative mentions), sorted by overall rating. The heatmap then visualises the full complaint profile of the 20 worst-rated stations, normalised to percentages so that stations with different review volumes are directly comparable. This reveals whether a station’s complaints are concentrated in one area or spread across many.

TipWhat travellers complain about, at a glance

Safety, Crowds, Staff/Service, and Toilets dominate the negative reviews. Staff/Service complaints are the most damaging (avg 1.6 stars), Safety the most widespread (573 mentions).

Show code
# Per-station complaint summary (AI)
if 'ai' in dir() and len(ai):
    neg = ai[ai['sentiment'] == 'negative'].copy()

    # Aspect-level filter (consistent with the section-wide threshold)
    aspect_totals = neg['aspect'].value_counts()
    keep_aspects = aspect_totals[aspect_totals >= MIN_COMPLAINTS_PER_ASPECT].index
    neg = neg[neg['aspect'].isin(keep_aspects)]

    # Number of unique reviews analysed per station (denominator for %)
    reviews_analysed = (
        ai.drop_duplicates(['station_name', 'date_estimated', 'text'])
        .groupby('station_name')
        .size()
        .rename('reviews_analysed')
    )

    station_aspect = (
        neg.groupby(['station_name', 'aspect'])
        .size()
        .reset_index(name='complaints')
    )

    # Attach station rating + total reviews analysed
    stn_ratings = done.drop_duplicates('name')[['name', 'overall_rating']]
    station_aspect = (
        station_aspect
        .merge(stn_ratings, left_on='station_name', right_on='name', how='left')
        .drop(columns='name')
        .merge(reviews_analysed, on='station_name', how='left')
    )

    # Dominant complaint per station, sorted by rating ascending (worst first)
    dominant = (
        station_aspect
        .sort_values('complaints', ascending=False)
        .drop_duplicates(subset='station_name')
        .sort_values('overall_rating', ascending=True)
        .reset_index(drop=True)
    )
    dominant['pct_mentioning'] = (dominant['complaints'] / dominant['reviews_analysed'] * 100).round(1)
    dominant.index += 1

    table = dominant[['station_name', 'overall_rating', 'aspect', 'complaints', 'reviews_analysed', 'pct_mentioning']].copy()
    table.columns = ['Station', 'Rating', 'Top complaint', 'Mentions', 'Reviews analysed', '% mentioning of analysed station reviews']

    print('Dominant complaint topic per station (AI), sorted by rating:')
    with pd.option_context('display.max_rows', None):
        display(table)

    # Heatmap: station × aspect complaint profile (top 20 worst-rated only)
    top20 = dominant.head(20)['station_name'].tolist()
    pivot = (
        station_aspect[station_aspect['station_name'].isin(top20)]
        .pivot(index='station_name', columns='aspect', values='complaints')
        .fillna(0)
        .astype(int)
    )
    # Sort heatmap rows by rating ascending
    rating_order = dominant[dominant['station_name'].isin(top20)].set_index('station_name')['overall_rating']
    pivot = pivot.loc[rating_order.sort_values().index]

    # Normalize to percentages (row-wise)
    pivot = pivot.div(pivot.sum(axis=1), axis=0) * 100

    fig = px.imshow(
        pivot,
        color_continuous_scale='Reds',
        title='Complaint profile: Top 20 worst-rated stations (% of complaints per aspect)',
        labels={'color': '% of complaints'},
        aspect='auto',
        height=600,
        text_auto='.0f',
    )
    fig.show()
Dominant complaint topic per station (AI), sorted by rating:
Station Rating Top complaint Mentions Reviews analysed % mentioning of analysed station reviews
1 Yverdon-les-Bains 3.3 Safety 114 261 43.7
2 Genève-Aéroport 3.6 Crowds 29 167 17.4
3 Lenzburg 3.7 Crowds 20 71 28.2
4 Langenthal 3.7 Safety 6 27 22.2
5 Aarau 3.8 Safety 32 137 23.4
6 Bern Wankdorf 3.8 Accessibility 4 24 16.7
7 Bern Europaplatz 3.8 Signage/Nav 2 5 40.0
8 Glattbrugg 3.8 Cleanliness 3 12 25.0
9 Olten 3.8 Safety 17 127 13.4
10 Wil SG 3.8 Safety 14 53 26.4
11 Uster 3.9 Cleanliness 8 96 8.3
12 Oensingen 3.9 Toilets 1 4 25.0
13 Solothurn 3.9 Safety 26 131 19.8
14 Zürich Hardbrücke 3.9 Cleanliness 11 81 13.6
15 Brugg AG 3.9 Cleanliness 8 54 14.8
16 Lausanne 4.0 Connections 29 437 6.6
17 Rotkreuz 4.1 Toilets 2 23 8.7
18 Sion 4.1 Safety 9 45 20.0
19 Kreuzlingen 4.1 Cleanliness 3 19 15.8
20 Fribourg/Freiburg 4.1 Cleanliness 10 88 11.4
21 Vevey 4.1 Safety 11 39 28.2
22 Bülach 4.1 Parking/Bikes 2 14 14.3
23 Baden 4.1 Signage/Nav 12 257 4.7
24 Biel/Bienne 4.1 Safety 13 92 14.1
25 Visp 4.1 Safety 14 78 17.9
26 Basel SBB 4.1 Toilets 51 776 6.6
27 La Chaux-de-Fonds 4.1 Crowds 4 22 18.2
28 Frauenfeld 4.1 Safety 4 25 16.0
29 Landquart 4.1 Toilets 1 23 4.3
30 Burgdorf 4.2 Connections 3 12 25.0
31 Zürich Enge 4.2 Cleanliness 5 41 12.2
32 Zürich Oerlikon 4.2 Staff/Service 36 617 5.8
33 Zürich Stadelhofen 4.2 Crowds 15 181 8.3
34 Delémont 4.2 Staff/Service 2 14 14.3
35 Genève 4.2 Toilets 25 327 7.6
36 Winterthur 4.2 Crowds 13 147 8.8
37 Sargans 4.2 Cleanliness 4 47 8.5
38 Opfikon 4.3 Toilets 2 4 50.0
39 Martigny 4.3 Crowds 4 23 17.4
40 Bellinzona 4.3 Crowds 5 70 7.1
41 Kreuzlingen Hafen 4.3 Cleanliness 1 12 8.3
42 Neuchâtel 4.3 Parking/Bikes 4 35 11.4
43 Chur 4.3 Safety 18 113 15.9
44 Zug 4.3 Toilets 8 89 9.0
45 Schaffhausen 4.3 Safety 10 60 16.7
46 St. Gallen 4.3 Toilets 11 104 10.6
47 Luzern 4.4 Crowds 55 1374 4.0
48 Zürich HB 4.4 Signage/Nav 101 1618 6.2
49 Brig 4.4 Toilets 4 40 10.0
50 Lugano 4.4 Staff/Service 13 189 6.9
51 Arth-Goldau 4.4 Toilets 3 73 4.1
52 Thun 4.5 Safety 5 61 8.2
53 Nyon 4.5 Staff/Service 3 30 10.0
54 Bern 4.5 Crowds 20 198 10.1
55 Rorschach 4.5 Food & Shops 1 11 9.1
56 Locarno 4.5 Connections 5 24 20.8
57 Montreux 4.5 Staff/Service 4 49 8.2
58 Rapperswil SG 4.6 Staff/Service 2 33 6.1
59 Genève-Eaux-Vives 4.6 Seating/Waiting 1 14 7.1
60 Rorschach Stadt 4.6 Connections 1 3 33.3
61 Rorschach Hafen 4.7 Staff/Service 2 12 16.7
(a) Per-station complaint profile: the 20 worst-rated stations × aspect, normalised to percentages so stations with different review volumes are comparable.
(b)
Figure 22

It needs to be kept in mind that not all stations have sufficient negative reviews to derive actionable items. For example Oestringen, Bern Wankdorf, Bern Europaplatz all have below 5 negative sentiment reviews for the one most mentioned aspect, requiring further analysis to derive actual improvements for the station.

7: From Insights to Action

The complaint data above is rich but does not yet point to concrete actions. Two further scripts synthesise it into actionable outputs:

  • sbb-reviews summarize-complaints identifies the top 20 worst-rated stations and ranks their (station, aspect) complaint groups by a priority score. The score combines how much each aspect contributes to a station’s complaints, how severe those complaints are, how recent they are, and how poorly the station is rated overall, so the items at the top are concentrated, harsh, fresh, and at stations already known to be problematic. The full formula and what each term does is spelled out just before the chart at the end of the section.

  • sbb-reviews summarize-strengths profiles the top 20 best-rated stations, surfaces the aspects they consistently excel at, and runs a gap analysis against the worst stations.

Run sbb-reviews summarize-complaints and sbb-reviews summarize-strengths to generate the CSVs below.

Below I first look at what top stations get praised for, then contrast that with what worst stations get complained about, and finally turn that contrast into a concrete list of priority action items grouped by expected return on investment.

What the Best Stations Do Right

Among the top 20 best-rated stations, only six aspects qualify as genuine strengths
(≥ 5 % of a station’s positive reviews, ≥ 5 mentions). Sorted by how many stations share each strength. This is used as a proxy for transferability.

Show code
# Cross-station strength frequency (best-rated stations)
_highlights_csv = pathlib.Path("../data/derived/station_highlights.csv")
if not _highlights_csv.exists():
    print("station_highlights.csv not found -- run `sbb-reviews summarize-strengths` first.")
else:
    highlights = pd.read_csv(_highlights_csv)

    freq = (
        highlights.groupby("aspect")
        .agg(stations=("station_name", "nunique"), avg_pct=("pct_of_station", "mean"))
        .sort_values("stations", ascending=False)
        .reset_index()
    )
    freq["avg_pct"] = freq["avg_pct"].round(1)

    fig = px.bar(
        freq.sort_values("stations"),
        x="stations", y="aspect",
        orientation="h",
        color="aspect", color_discrete_map=ASPECT_COLORS,
        text="stations",
        title="Transferable Strengths: Aspects Shared by Top-20 SBB Stations",
        labels={"stations": "Number of top-20 stations praised for this aspect", "aspect": ""},
    )
    fig.update_traces(textposition="outside")
    fig.update_layout(showlegend=False, height=400)
    fig.show()
Figure 23: Number of top-20 best-rated stations that excel at each aspect.

Overall, Connections and Food and Shops are the aspects most mentioned in positive reviews, closely followed by Cleanliness and Staff/Service.

Visible Wins vs Damage Control

Comparing what top-20 stations get praised for against what bottom-20 stations get complained about reveals two fundamentally different types of investment:

Category Meaning
Visible win Top stations get praised for this aspect. Investment has a visible payoff in positive sentiment.
Damage control Even top stations do not get praised here. Fixing it stops complaints but will not generate praise. Travellers expect it to “just work”.

Both categories require investment. The difference is in the expected return.

Show code
# Gap analysis -- visible wins vs damage control
_gap_csv = pathlib.Path("../data/derived/strength_gaps.csv")
if not _gap_csv.exists():
    print("strength_gaps.csv not found -- run `sbb-reviews summarize-strengths` first.")
else:
    gaps = pd.read_csv(_gap_csv)
    gaps["display_category"] = gaps["category"].map({
        "learnable": "visible_win",
        "systemic_gap": "damage_control",
        "minor": "minor",
    })

    category_colors = {"damage_control": "#d32f2f", "visible_win": "#388e3c", "minor": "#9e9e9e"}

    fig = px.bar(
        gaps[gaps["display_category"] != "minor"].sort_values("complaint_rows", ascending=False),
        x="aspect",
        y="complaint_rows",
        color="display_category",
        color_discrete_map=category_colors,
        text="strength_stations",
        labels={
            "complaint_rows": "Complaint rows (worst stations)",
            "aspect": "Aspect",
            "display_category": "Category",
            "strength_stations": "Top stations excelling",
        },
        title="Visible Wins vs Damage Control by Aspect",
    )
    fig.update_traces(texttemplate="%{text} top stations", textposition="outside")
    fig.update_layout(height=420, xaxis_tickangle=-20)
    fig.show()
Figure 24: Aspect-level gap analysis: visible wins vs damage control.

All aspects on this chart require investment. Food & Shops needs retail buildout, Connections needs platforms and scheduling work, Cleanliness needs staff and maintenance. The distinction is not cost itself. It is whether top stations get praised for the aspect, which signals whether spending shows up as positive sentiment or just as the absence of complaints.

In this dataset, Safety, Crowds, and Toilets fall on the damage-control side, while Cleanliness, Connections, Food & Shops, Staff/Service, and Signage/Nav are visible wins.

Connections deserves a footnote: it appears as a visible win because top transfer hubs are praised for being well-connected, but the core levers (frequency, scheduling) depend on network planning, not station-level effort. Local levers (clear transfer signage, real-time departure info, staff assistance for missed connections) are still worth pursuing.

TipAction priorities at a glance

Worst-rated stations share systemic Safety, Crowds, and Toilet issues that need infrastructure investment. Cleanliness, Connections, Staff/Service, and Food & Shops are visible wins where top stations show a clear playbook to copy.

Derivation for the Top Improvements to Ensure Customer Satisfaction

Each (station, aspect) complaint group is scored to surface the most urgent items. The priority score is:

priority = pct_of_station × severity_weight × recency_weight × (1 + station_weight)

where:

  • pct_of_station is the share of a station’s negative complaints that this aspect accounts for. Using a percentage instead of a raw count makes the score station-size-neutral: an aspect that drives half the complaints at a small station counts the same as one that drives half at a big station.
  • severity_weight = (6 - avg_rating) / 5 captures how harshly travellers rated reviews that mentioned this aspect. A 1-star average gives a weight of 1.0. A 5-star average gives 0.2.
  • recency_weight multiplies by how fresh the complaints are, based on the most recent complaint date in the group. Active issues (last 90 days) get a 2.0× boost. Stale ones (no complaints in 2+ years) get 0.2×.
  • station_weight = 5 - station_overall_rating lifts the score for stations whose overall rating is already dragging the network down, so two equally severe complaints get prioritised at the worse-rated station.

The chart and the two tables below group each action item by whether the aspect is a visible win or damage control, as defined earlier in the section. The results are top recommendations based on said priority metric. The summary text and the results are summarized by AI.

Show code
# Priority action items (worst-rated stations, split by upside profile)
from IPython.display import Markdown

_action_csv = pathlib.Path("../data/derived/action_items.csv")
_gaps_csv = pathlib.Path("../data/derived/strength_gaps.csv")

if not _action_csv.exists() or not _gaps_csv.exists():
    print("action_items.csv or strength_gaps.csv not found. Run `sbb-reviews summarize-complaints` and `summarize-strengths` first.")
else:
    actions = pd.read_csv(_action_csv)
    aspect_cat = pd.read_csv(_gaps_csv)[["aspect", "category"]]
    actions = actions.merge(aspect_cat, on="aspect", how="left")
    actions["bucket"] = actions["category"].map({
        "learnable": "Visible Win",
        "systemic_gap": "Damage Control",
    }).fillna("Other")

    def render_items(df, heading, subtitle):
        lines = [f"### {heading}", f"_{subtitle}_", ""]
        for i, row in enumerate(df.itertuples(index=False), 1):
            lines.append(
                f"**{i}. {row.station_name}** ({row.station_overall_rating:.1f}★) — "
                f"*{row.aspect}* — {row.complaint_count} complaints, "
                f"{row.pct_of_station:.0f}% of station's complaints, priority score {row.priority_score:.0f}"
            )
            lines.append("")
            lines.append(f"> {row.summary}")
            lines.append("")
        return Markdown("\n".join(lines))

    visible_wins = actions[actions["bucket"] == "Visible Win"].head(10)
    hygiene_fixes = actions[actions["bucket"] == "Damage Control"].head(10)

    print(f"Total action items: {len(actions)}  |  Stations: {actions['station_name'].nunique()}")
    display(render_items(
        hygiene_fixes,
        "Damage control: top 10",
        "No top station excels here; payoff is complaint reduction.",
    ))
    display(render_items(
        visible_wins,
        "Visible Wins: top 10",
        "Top stations validate this aspect; payoff is positive sentiment.",
    ))

    # Combined priority chart, colored by bucket
    plot_df = actions.head(20).copy()
    plot_df["label"] = plot_df["station_name"] + ": " + plot_df["aspect"]
    fig = px.bar(
        plot_df,
        x="priority_score",
        y="label",
        color="bucket",
        orientation="h",
        color_discrete_map={
            "Visible win": "#388e3c",
            "Damage control": "#d32f2f",
            "Other": "#9e9e9e",
        },
        labels={"priority_score": "Priority Score", "label": "", "bucket": "Category"},
        title="Top 20 Priority Action Items: Total Priority Scores Unified: Visible Wins vs Damage Control",
    )
    fig.update_layout(yaxis={"categoryorder": "total ascending"}, height=550, margin={"l": 280})
    fig.show()
Total action items: 56  |  Stations: 16

Damage control: top 10

No top station excels here; payoff is complaint reduction.

1. Yverdon-les-Bains (3.3★) — Safety — 114 complaints, 63% of station’s complaints, priority score 136

Complaints describe Yverdon-les-Bains as unsafe due to persistent drug dealing, addicts, beggars, drunk individuals, and people harassing others around the station. Reviewers also mention weed smoke, fights at night, a gloomy atmosphere, and police presence seen as frequent but ineffective.

2. Solothurn (3.9★) — Safety — 26 complaints, 38% of station’s complaints, priority score 131

Safety complaints at Solothurn center on an unsettling atmosphere created by drug users, dealers, alcoholics, beggars, and other suspicious individuals, especially in the evening and after dark. Several reviews also mention aggressive solicitation and describe the station as functioning like a public drug scene.

3. Lenzburg (3.7★) — Crowds — 20 complaints, 33% of station’s complaints, priority score 124

Crowding complaints at Lenzburg consistently say the station is too small for passenger volumes, especially at peak times. Narrow platforms, too few underpasses, and congested exits create overcrowding, navigation difficulties, and a sense of danger or panic.

4. Aarau (3.8★) — Safety — 32 complaints, 38% of station’s complaints, priority score 123

Aarau is described as feeling unsafe because of loitering intoxicated groups, aggressive or unsavory individuals, and chaotic conditions around the entrance and outside the station. Complaints highlight nighttime danger, weekend fights, smoking and noise, and a general sense that authorities are not keeping the area under control.

5. Wil SG (3.8★) — Safety — 14 complaints, 40% of station’s complaints, priority score 117

Complaints about Wil SG focus on a persistently unsafe atmosphere linked to drug users, alcoholics, beggars, and other undesirable individuals around the station and bus area. Reviewers particularly mention poor lighting, weak security or police presence, nighttime risk for women, and even reports of assaults and violent incidents.

6. Genève-Aéroport (3.6★) — Crowds — 29 complaints, 24% of station’s complaints, priority score 98

Complaints about Genève-Aéroport focus on severe overcrowding and queue management problems, especially long waits at passport control, security, and luggage collection. Reviewers describe the airport as chaotic, slow, claustrophobic, and poorly organized, with too few checkpoints open and added inconvenience when facilities are closed.

7. Olten (3.8★) — Safety — 17 complaints, 24% of station’s complaints, priority score 81

Complaints about Olten highlight strong personal safety concerns linked to drug use, drunk or aggressive individuals, beggars, and threatening encounters. Reviewers particularly describe the station and surrounding area as uncomfortable or unsafe at night, especially for women, with reports of assaults, intimidation, and broader crime concerns.

8. Zürich Hardbrücke (3.9★) — Safety — 7 complaints, 13% of station’s complaints, priority score 48

Complaints about Zürich Hardbrücke center on feeling unsafe, especially at night, with reports of being followed, attacked with objects, and discomfort around loitering youth. Reviewers also mention dark surroundings, smoking concerns, and dangerous bike-pedestrian traffic conflicts as contributing to the station’s unsafe atmosphere.

9. Basel SBB (4.1★) — Toilets — 51 complaints, 15% of station’s complaints, priority score 41

Basel SBB toilet complaints mainly concern having to pay high fees for the station’s toilets, often cash-only or coin-operated, with broken payment/access systems that can leave passengers unable to use them. Many also describe the facilities as dirty, outdated, poorly maintained, and insufficiently available, including long waits and missing toilets in some areas.

10. Zürich Hardbrücke (3.9★) — Crowds — 7 complaints, 13% of station’s complaints, priority score 39

Zürich Hardbrücke is mainly criticized for overcrowding, especially at peak times, where heavy commuter traffic reduces comfort. Reviews also mention a dark, chaotic atmosphere during construction, evening crowds involving smokers and intoxicated or drug-using youths, and generally stressful circulation through the station.

(a) Top 20 priority action items, coloured by category.

Visible Wins: top 10

Top stations validate this aspect; payoff is positive sentiment.

1. Brugg AG (3.9★) — Cleanliness — 8 complaints, 24% of station’s complaints, priority score 82

Complaints about Brugg AG center on poor cleanliness and neglect, with repeated mentions of a dirty station, dirty platforms, and a general rundown feel. Reviewers also highlight persistent unpleasant odors, including urine smells, and filthy areas where people drink alcohol.

2. Zürich Hardbrücke (3.9★) — Cleanliness — 11 complaints, 20% of station’s complaints, priority score 72

Complaints about Zürich Hardbrücke focus on poor cleanliness, with repeated descriptions of dirty platforms and generally dirty station conditions. Bad smells, especially urine and smoke, along with an ugly, uncomfortable environment and broken lifts, reinforce the sense of neglect.

3. Lausanne (4.0★) — Connections — 29 complaints, 14% of station’s complaints, priority score 46

Complaints about connections at Lausanne center on chronic delays, cancellations, and trains rarely running on time, often causing missed appointments. Passengers also report last-minute platform or route changes, connections not waiting, disruption from ongoing construction and limited platform capacity, and poor or absent information when problems occur.

4. Basel SBB (4.1★) — Staff/Service — 41 complaints, 12% of station’s complaints, priority score 44

Basel SBB staff/service complaints are dominated by rude, dismissive, and unhelpful staff at ticket, information, and service counters, including poor support in English and with ticketing or refund issues. Reviews also mention unfair fines, long queues and waiting times, and lack of customer service during holidays.

5. Zürich Hardbrücke (3.9★) — Connections — 7 complaints, 13% of station’s complaints, priority score 44

At Zürich Hardbrücke, connection complaints focus on frequent delays, especially on the S6 to Baden, linked to bottlenecks, high train density, and limited track capacity. Passengers also mention confusing platform layouts, difficult transfers between bus stops, overcrowded trains, and a lack of direct service.

6. Genève-Aéroport (3.6★) — Signage/Nav — 18 complaints, 15% of station’s complaints, priority score 42

Genève-Aéroport signage/navigation complaints point to unclear or missing signage throughout the station-airport interface, including check-in desks, terminals, buses, train departures, and security. Passengers describe the layout as disorganized and confusing, with poor communication of platform changes and too little accessible information for timetables and other services.

7. Genève-Aéroport (3.6★) — Staff/Service — 22 complaints, 18% of station’s complaints, priority score 39

Complaints at Genève-Aéroport focus on poor staff service: rude or dismissive behavior, little help with ticket changes and reimbursements, and weak support for non-French speakers. Travelers also report missing or understaffed service points, badly managed passport/check-in queues, slow security and manual processes, plus limited Wi‑Fi and poor late-evening service availability.

8. Olten (3.8★) — Food & Shops — 8 complaints, 11% of station’s complaints, priority score 38

At Olten, the main issue is a weak retail and food offer, with few shops, mostly takeaway options, and no proper restaurant. Complaints also repeatedly highlight missing ATMs for Swiss francs, inconvenient shop locations, and prices seen as too high for the limited choice.

9. Genève-Aéroport (3.6★) — Food & Shops — 15 complaints, 12% of station’s complaints, priority score 38

At Genève-Aéroport, food and retail complaints center on very high prices paired with too little choice, including expensive drinks, poor-quality dining, and few shops or stalls open. Travelers also criticize early closing times, long queues when outlets open late, and weak amenities such as confusing locker policies and lack of charging points.

10. Biel/Bienne (4.1★) — Staff/Service — 7 complaints, 14% of station’s complaints, priority score 37

Complaints center on consistently poor staff interactions at Biel/Bienne, with employees described as aggressive, arrogant, unfriendly, and not sufficiently helpful in English. Service frustrations are worsened by cancellations and the lack of staffed ticket counters after 8 PM, leaving passengers without support.

(b)
(c)
Figure 25

8: Takeaways & Limitations

What I found out

  • Overall sentiment is positive. The network averages 4.16 stars, yet the negative tail concentrates around a handful of stations where issues are systemic rather than scattered.
  • Reviewers are harshest in their home region. Local reviewers consistently rate their own region’s stations lower than out-of-region reviewers do. Tourists give the most generous ratings overall.
  • Ratings drift over time, but slowly. Most stations are stable across 2022-2025. A few (for example, Wil SG, Oensingen, Bülach) gained 0.6-0.8 stars, suggesting recent improvements were noticed.
  • Two strategic categories of complaint emerged. Safety, Crowds, and Toilets are damage control: high complaint volume but no praise even at top stations, so investment stops the bleeding without lifting sentiment. Cleanliness, Connections, Food & Shops, Staff/Service, and Signage/Nav are visible wins: top stations show that positive sentiment is achievable.
  • Staff/Service, Connections, and Safety complaints sit on the harshest reviews (average ratings 1.6, 1.9, 2.0 stars, well below the other aspects). Combined with how often each appears, Safety, Staff/Service, and Crowds dominate the negative-rating story: Crowds drops fewer star points per complaint but shows up in very high volume, while Connections is severe but less frequent.

What this analysis cannot tell me

  • No causal claims. Correlation between aspect and rating is not proof of cause. A station may be rated low for reasons not captured in its review text at all.
  • Coverage is partial. Roughly 2-3% of reviews were lost to AI batch truncation, and with the scraped review texts being cutoff at ~140-character, that means long, nuanced reviews are clipped before the model ever sees them. Romansh reviewers are absent entirely.
  • Self-selection bias. Google Maps reviewers are not a representative sample of travellers. The people who bother writing a review are usually the ones with strong feelings, either very satisfied or very upset. The everyday commuter who finds the station “fine” rarely shows up in the data, so the dataset over-represents extremes in both directions.
  • Volume in reviews is not a measure of importance. Some aspects show up less often because the affected population is smaller, not because the issue matters less. Accessibility is the clearest example: Wheelchair users, parents with strollers, and travellers with mobility constraints make up a small share of reviewers, so the aspect ranks low on volume and falls out of the priority charts. That does not make it any less critical to the people for whom a missing lift or an inaccessible platform is the difference between using the station and not using it. Review-derived priorities are a useful complement to, not a substitute for, dedicated accessibility audits and SBB’s existing standards.
  • No within-station granularity. A complaint at Zurich HB does not tell me whether the issue was on a platform, in the underground retail level, or at a ticket counter. The whole station is treated as one unit, which also makes cross-station comparison tricky: “crowds” at a 100,000-passenger-per-day hub means something very different from “crowds” at a small regional station, but in the dataset they look identical.
  • Snapshot, not stream. This is a one-shot analysis. It only captures the data as of the date the reviews were scraped. The real value of such an analysis would be to conduct it live from running it continuously.

What I would do next

  • Connect the scraper + AI classifier into a scheduled job + add more data sources such as social media and travel forums so sentiment can be tracked monthly per station.
  • Add a comparison baseline: rate other major European stations the same way to see whether SBB’s 4.16 is genuinely high or just typical for the segment.
  • Cross-reference complaint timelines against operational events (renovation projects, schedule changes) to test whether interventions show up in sentiment data.