目录
一、整体架构理解
1. 数据分层架构
2. 事务事实表特点
二、各表详细解析
1. 工具域优惠券使用事务事实表
业务场景深度解析
数据流转细节
装载逻辑详解
典型分析场景
2. 互动域收藏商品事务事实表
业务场景深度解析
数据装载技术细节
数据分析应用
3. 流量域页面浏览事务事实表
业务场景深度解析
字段详细解释
复杂技术点解析
流量分析实战
4. 用户域用户注册事务事实表
业务场景深度解析
表关联逻辑深度解析
注册分析实战
5. 用户域用户登录事务事实表
业务场景深度解析
技术实现深度解析
登录行为分析实战
三、数据治理与质量控制
1. 数据一致性检查
2. 数据血缘分析
3. 性能优化建议
四、业务应用场景总结
1. 运营分析
2. 用户分析
3. 产品分析
4. 商业分析
五、常见问题与解决方案
1. 数据延迟问题
2. 数据质量监控
3. 数据回溯处理
六、完整代码
一、整体架构理解
1. 数据分层架构
原始数据 → ODS层(原始数据层) → **DWD层(数据明细层)** → DWS层(数据服务层) → ADS层(应用数据层)
2. 事务事实表特点
粒度:最细粒度业务事件
维度:包含丰富的上下文维度信息
度量:主要是计数型指标
更新策略:首日全量 + 每日增量
二、各表详细解析
1. 工具域优惠券使用事务事实表
业务场景深度解析
使用时机:用户下单支付时核销优惠券
前置状态:优惠券已领取但未使用
触发事件:订单支付成功
业务价值:
计算优惠券核销率
分析优惠券对订单转化率的影响
评估营销活动ROI
数据流转细节
-- ODS层数据结构示例 ods_coupon_use_inc { type: 'bootstrap-insert' 或 'update', data: { id: '123', coupon_id: 'coupon_001', user_id: 'user_1001', order_id: 'order_20220608001', used_time: '2022-06-08 10:30:45' -- 关键字段,不为空表示已使用 }, old: {used_time: null} -- 更新前值,用于判断状态变更 }装载逻辑详解
-- 首日装载:处理历史全量数据 -- 筛选条件:used_time不为空的记录 -- 这表示优惠券已经被使用(核销) -- 每日增量装载:处理当日状态变更 -- array_contains(map_keys(old),'used_time') 解释: -- 1. map_keys(old):获取old字典的所有键 -- 2. array_contains:检查数组中是否包含'used_time' -- 3. 业务含义:记录从未使用状态变更为已使用状态 -- 4. 为什么不用data.used_time is not null? -- 因为可能是其他字段的更新,需要确保是状态变更典型分析场景
-- 1. 每日优惠券核销统计 SELECT date_id, COUNT(DISTINCT coupon_id) as used_coupons, COUNT(DISTINCT user_id) as using_users, COUNT(DISTINCT order_id) as using_orders FROM dwd_tool_coupon_used_inc WHERE dt BETWEEN '2022-06-01' AND '2022-06-30' GROUP BY date_id ORDER BY date_id; -- 2. 优惠券核销时间分布 SELECT HOUR(payment_time) as hour_of_day, COUNT(*) as usage_count FROM dwd_tool_coupon_used_inc WHERE dt = '2022-06-09' GROUP BY HOUR(payment_time) ORDER BY hour_of_day;2. 互动域收藏商品事务事实表
业务场景深度解析
用户行为:主动表达对商品的兴趣
业务价值:
用户兴趣画像:收藏行为反映用户偏好
商品热度分析:被收藏次数多的商品更受欢迎
转化预测:收藏到购买的转化率分析
推荐系统:协同过滤算法的重要输入
数据装载技术细节
-- 首日装载:全量历史收藏记录 -- 为什么用'bootstrap-insert'? -- 这是数据初始化的一种约定,表示首次全量导入 -- 每日装载:只处理'insert'类型 -- 为什么不是'update'? -- 收藏行为一般是新增,取消收藏可能是'delete'或'update'状态 -- 这里只记录添加收藏的行为,取消收藏不记录(或记录在另一张事实表)数据分析应用
-- 1. 用户收藏行为分析 SELECT user_id, COUNT(*) as total_favors, COUNT(DISTINCT sku_id) as unique_sku_favors, MIN(create_time) as first_favor_time, MAX(create_time) as latest_favor_time FROM dwd_interaction_favor_add_inc WHERE dt = '2022-06-09' GROUP BY user_id HAVING COUNT(*) > 5; -- 找出重度收藏用户 -- 2. 商品收藏热度排行 SELECT sku_id, COUNT(*) as favor_count, COUNT(DISTINCT user_id) as favored_users, DATE_FORMAT(MIN(create_time), 'yyyy-MM-dd') as first_favor_date FROM dwd_interaction_favor_add_inc WHERE dt BETWEEN '2022-06-01' AND '2022-06-09' GROUP BY sku_id ORDER BY favor_count DESC LIMIT 100; -- 3. 收藏转化漏斗分析(需要关联订单表) WITH favor_stats AS ( SELECT f.user_id, f.sku_id, f.create_time as favor_time, o.order_time, CASE WHEN o.order_id IS NOT NULL THEN 1 ELSE 0 END as is_converted FROM dwd_interaction_favor_add_inc f LEFT JOIN dwd_order_info_inc o ON f.user_id = o.user_id AND f.sku_id = o.sku_id AND o.order_time > f.create_time AND o.order_time <= DATE_ADD(f.create_time, 7, 'DAY') -- 7天内转化 WHERE f.dt = '2022-06-01' ) SELECT COUNT(*) as total_favors, SUM(is_converted) as converted_favors, ROUND(SUM(is_converted) * 100.0 / COUNT(*), 2) as conversion_rate FROM favor_stats;3. 流量域页面浏览事务事实表
业务场景深度解析
页面类型分类:
首页、商品详情页、购物车页、订单确认页、支付页
活动页、品牌页、搜索结果页、个人中心页
用户路径分析:
页面流:A→B→C
跳出率:只访问一个页面就离开
转化路径:浏览→点击→加购→下单→支付
字段详细解释
-- 设备维度 mid_id: 'device_001' -- 设备唯一标识,用于设备级统计 brand: 'Apple' -- 设备品牌,分析用户设备偏好 model: 'iPhone 13' -- 设备型号 operate_system: 'iOS 15.4' -- 操作系统版本 version_code: '3.2.1' -- APP版本,用于分析版本迭代效果 -- 用户维度 user_id: 'user_1001' -- 登录用户ID(可能为空,未登录用户) province_id: '440000' -- 省份ID,基于IP解析 -- 页面维度 page_id: 'home' -- 页面标识 last_page_id: 'search' -- 上一个页面,分析流量来源 page_item: 'sku_1001' -- 页面内容(如商品ID) page_item_type: 'sku' -- 内容类型(sku、activity、brand等) -- 流量来源 refer_id: 'channel_001' -- 营销渠道ID(广告渠道、社交媒体等) from_pos_id: 'banner_01' -- 页面内的位置(如banner位) from_pos_seq: '1' -- 同一位置多个内容的排序 -- 行为度量 during_time: 5000 -- 页面停留时长(毫秒),计算页面吸引力 view_time: '2022-06-08 10:30:45' -- 进入页面时间 session_id: 'session_001' -- 会话ID,同一会话内多次页面浏览复杂技术点解析
-- 1. 时间转换:为什么要用from_utc_timestamp? -- 原始日志中的ts通常是UTC时间戳(如1654669845000) -- 业务分析需要本地时间(GMT+8北京时间) -- date_format(from_utc_timestamp(ts,'GMT+8'),'yyyy-MM-dd') 作用: -- a. from_utc_timestamp(ts,'GMT+8'): 将UTC时间戳转为北京时间 -- b. date_format(...): 提取日期部分作为date_id -- 2. 为什么设置hive.cbo.enable=false? -- CBO(Cost Based Optimizer)是Hive的基于成本的优化器 -- 在复杂查询中,CBO可能生成非最优执行计划 -- 对于ETL处理,有时关闭CBO可以获得更稳定的性能 -- 3. 页面浏览的会话划分逻辑 -- 会话(session)是用户连续访问的页面序列 -- 通常按以下规则划分: -- a. 同一设备(mid_id)同一用户(user_id)的连续访问 -- b. 相邻页面访问间隔不超过30分钟 -- c. 会话ID在日志采集时生成流量分析实战
-- 1. 基础流量指标计算 WITH daily_traffic AS ( SELECT date_id, COUNT(*) as pv, -- 页面浏览量 COUNT(DISTINCT mid_id) as uv, -- 访客数(基于设备) COUNT(DISTINCT user_id) as login_uv, -- 登录用户数 COUNT(DISTINCT session_id) as sessions, -- 会话数 AVG(during_time) as avg_duration, -- 平均页面停留时长 SUM(CASE WHEN user_id IS NOT NULL THEN 1 ELSE 0 END) as login_pv -- 登录用户PV FROM dwd_traffic_page_view_inc WHERE dt = '2022-06-08' GROUP BY date_id ) SELECT * FROM daily_traffic; -- 2. 页面热力图分析 SELECT page_id, COUNT(*) as pv, COUNT(DISTINCT mid_id) as uv, COUNT(DISTINCT user_id) as login_users, AVG(during_time) as avg_duration, PERCENTILE(during_time, 0.5) as median_duration, -- 中位数更抗异常 SUM(CASE WHEN last_page_id IS NULL THEN 1 ELSE 0 END) as entry_count, -- 入口页面次数 SUM(CASE WHEN during_time < 3000 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as bounce_rate -- 跳出率估算 FROM dwd_traffic_page_view_inc WHERE dt = '2022-06-08' GROUP BY page_id ORDER BY pv DESC; -- 3. 用户路径分析(页面流) WITH page_flow AS ( SELECT session_id, page_id, last_page_id, view_time, LAG(page_id) OVER (PARTITION BY session_id ORDER BY view_time) as prev_page, LEAD(page_id) OVER (PARTITION BY session_id ORDER BY view_time) as next_page, ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY view_time) as page_seq FROM dwd_traffic_page_view_inc WHERE dt = '2022-06-08' AND session_id IS NOT NULL ) SELECT COALESCE(last_page_id, 'entry') as from_page, page_id as to_page, COUNT(*) as transition_count, COUNT(DISTINCT session_id) as session_count FROM page_flow WHERE last_page_id IS NOT NULL OR page_seq = 1 GROUP BY COALESCE(last_page_id, 'entry'), page_id ORDER BY transition_count DESC LIMIT 20; -- 4. 转化漏斗分析(以商品详情页到下单为例) WITH user_journey AS ( SELECT user_id, session_id, MAX(CASE WHEN page_id = 'good_detail' THEN 1 ELSE 0 END) as viewed_detail, MAX(CASE WHEN page_id = 'cart' THEN 1 ELSE 0 END) as viewed_cart, MAX(CASE WHEN page_id = 'trade' THEN 1 ELSE 0 END) as viewed_trade FROM dwd_traffic_page_view_inc WHERE dt = '2022-06-08' AND user_id IS NOT NULL GROUP BY user_id, session_id ) SELECT COUNT(*) as total_sessions, SUM(viewed_detail) as detail_sessions, SUM(viewed_cart) as cart_sessions, SUM(viewed_trade) as trade_sessions, ROUND(SUM(viewed_detail) * 100.0 / COUNT(*), 2) as detail_rate, ROUND(SUM(viewed_cart) * 100.0 / SUM(viewed_detail), 2) as cart_conversion, ROUND(SUM(viewed_trade) * 100.0 / SUM(viewed_cart), 2) as trade_conversion FROM user_journey; -- 5. 渠道效果分析 SELECT channel, refer_id, COUNT(*) as pv, COUNT(DISTINCT mid_id) as uv, COUNT(DISTINCT user_id) as login_uv, SUM(CASE WHEN page_id = 'good_detail' THEN 1 ELSE 0 END) as detail_pv, SUM(CASE WHEN page_item_type = 'sku' THEN 1 ELSE 0 END) as sku_view_pv FROM dwd_traffic_page_view_inc WHERE dt = '2022-06-08' AND refer_id IS NOT NULL GROUP BY channel, refer_id ORDER BY pv DESC;4. 用户域用户注册事务事实表
业务场景深度解析
注册渠道分析:
自然流量:用户主动下载注册
广告渠道:通过广告点击下载注册
社交分享:通过分享链接注册
地推扫码:线下推广扫码注册
用户质量评估:
注册设备分布
注册地域分布
注册时间分布
表关联逻辑深度解析
-- 关键关联:用户信息表 LEFT JOIN 日志表 -- 为什么用LEFT JOIN? -- 1. 以用户信息为主表,确保所有注册用户都被记录 -- 2. 日志表中可能没有某些用户的注册页面记录(原因可能包括): -- a. 老用户注册时日志系统未上线 -- b. 注册时网络问题导致日志丢失 -- c. 通过API注册,不走页面流程 -- 关联条件详解: -- ui.user_id = log.user_id -- 日志筛选条件: -- page.page_id='register' -- 注册页面 -- common.uid is not null -- 用户ID不为空(已注册) -- dt='2022-06-08' -- 同一天(T+1处理,假设注册和日志同天) -- 潜在问题与解决方案: -- 1. 时间差问题:注册时间和日志时间可能不是同一天 -- 解决方案:扩大日志查询时间范围,如注册时间前后3天 -- 2. 多设备注册:一个用户可能用多个设备注册 -- 处理逻辑:取时间最近的设备信息 -- 3. 注册后立即登录:日志可能记录的是登录页面而非注册页面 -- 需要根据业务逻辑调整筛选条件注册分析实战
-- 1. 每日注册用户统计(含渠道分布) SELECT date_id, COUNT(*) as total_registrations, COUNT(CASE WHEN channel = 'appstore' THEN user_id END) as appstore_reg, COUNT(CASE WHEN channel = 'huawei_store' THEN user_id END) as huawei_reg, COUNT(CASE WHEN channel LIKE '%wechat%' THEN user_id END) as wechat_reg, COUNT(CASE WHEN channel IS NULL THEN user_id END) as unknown_channel, COUNT(DISTINCT province_id) as province_count FROM dwd_user_register_inc WHERE dt BETWEEN '2022-06-01' AND '2022-06-30' GROUP BY date_id ORDER BY date_id; -- 2. 注册用户设备分析 SELECT operate_system, brand, model, COUNT(*) as user_count, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage FROM dwd_user_register_inc WHERE dt = '2022-06-09' GROUP BY operate_system, brand, model ORDER BY user_count DESC LIMIT 20; -- 3. 注册时间分布分析(小时级) SELECT HOUR(create_time) as register_hour, COUNT(*) as register_count, AVG(COUNT(*)) OVER () as avg_per_hour, COUNT(*) - AVG(COUNT(*)) OVER () as diff_from_avg FROM dwd_user_register_inc WHERE dt = '2022-06-09' GROUP BY HOUR(create_time) ORDER BY register_hour; -- 4. 注册用户地域分布 SELECT p.province_name, -- 需要关联省份维度表 COUNT(*) as register_count, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage FROM dwd_user_register_inc r LEFT JOIN dim_province p ON r.province_id = p.province_id WHERE r.dt = '2022-06-09' GROUP BY p.province_name ORDER BY register_count DESC; -- 5. 注册后行为分析(关联其他事实表) WITH registered_users AS ( SELECT user_id, create_time as register_time, channel as register_channel FROM dwd_user_register_inc WHERE dt = '2022-06-01' ), user_activities AS ( SELECT r.user_id, r.register_time, r.register_channel, -- 注册后7天内是否有登录 MAX(CASE WHEN l.login_time >= r.register_time AND l.login_time <= DATE_ADD(r.register_time, 7, 'DAY') THEN 1 ELSE 0 END) as has_login_7d, -- 注册后7天内是否有下单 MAX(CASE WHEN o.order_time >= r.register_time AND o.order_time <= DATE_ADD(r.register_time, 7, 'DAY') THEN 1 ELSE 0 END) as has_order_7d, -- 注册后7天内下单金额 SUM(CASE WHEN o.order_time >= r.register_time AND o.order_time <= DATE_ADD(r.register_time, 7, 'DAY') THEN o.order_amount ELSE 0 END) as order_amount_7d FROM registered_users r LEFT JOIN dwd_user_login_inc l ON r.user_id = l.user_id AND l.dt BETWEEN '2022-06-01' AND '2022-06-08' LEFT JOIN dwd_order_info_inc o ON r.user_id = o.user_id AND o.dt BETWEEN '2022-06-01' AND '2022-06-08' GROUP BY r.user_id, r.register_time, r.register_channel ) SELECT register_channel, COUNT(*) as total_registrations, SUM(has_login_7d) as active_users_7d, SUM(has_order_7d) as paying_users_7d, ROUND(SUM(has_login_7d) * 100.0 / COUNT(*), 2) as login_rate_7d, ROUND(SUM(has_order_7d) * 100.0 / COUNT(*), 2) as conversion_rate_7d, AVG(order_amount_7d) as avg_order_amount_7d FROM user_activities GROUP BY register_channel ORDER BY conversion_rate_7d DESC;5. 用户域用户登录事务事实表
业务场景深度解析
登录方式:
账号密码登录
手机验证码登录
第三方授权登录(微信、QQ)
生物识别登录(指纹、人脸)
会话管理:
会话超时:通常30分钟无操作后需要重新登录
会话保持:APP端可能长期保持登录状态
多设备登录:一个账号同时在多个设备登录
技术实现深度解析
-- 核心逻辑:每个会话的第一次页面浏览视为登录 -- 为什么这样设计? -- 1. 实际登录可能没有独立的日志事件 -- 2. 登录后的第一个页面浏览可以准确标识登录时间 -- 3. 避免重复记录同一会话内的多次页面刷新 -- 窗口函数详解: -- row_number() over (partition by common.sid order by ts) rn -- partition by common.sid: 按会话分组 -- order by ts: 按时间戳排序 -- rn = 1: 取每个会话的第一条记录 -- 筛选条件解释: -- page is not null: 确保是页面浏览事件(不是其他事件如点击、曝光) -- common.uid is not null: 确保是已登录用户(未登录用户common.uid为空) -- dt = '2022-06-08': 处理当天的数据 -- 潜在问题: -- 1. 页面自动刷新:可能被误判为多次登录 -- 解决方案:结合APP的session管理,或设置最小时间间隔 -- 2. 未登录用户浏览:不会被记录 -- 这是符合业务定义的:登录事实表只记录登录事件 -- 3. 同一用户多设备登录:会被记录为多次登录 -- 这是符合业务实际的:每次设备登录都应记录登录行为分析实战
-- 1. 基础登录指标 SELECT date_id, COUNT(*) as login_times, -- 登录次数 COUNT(DISTINCT user_id) as dau, -- 日活跃用户 COUNT(DISTINCT mid_id) as devices, -- 活跃设备数 AVG(CASE WHEN channel = 'appstore' THEN 1 ELSE 0 END) as appstore_rate, AVG(CASE WHEN channel LIKE '%wechat%' THEN 1 ELSE 0 END) as wechat_rate FROM dwd_user_login_inc WHERE dt = '2022-06-08' GROUP BY date_id; -- 2. 用户登录频次分析 WITH user_login_stats AS ( SELECT user_id, COUNT(*) as login_days, MIN(date_id) as first_login_date, MAX(date_id) as last_login_date, COUNT(DISTINCT brand) as device_brands, COUNT(DISTINCT channel) as login_channels FROM dwd_user_login_inc WHERE dt BETWEEN '2022-06-01' AND '2022-06-30' GROUP BY user_id ) SELECT login_days, COUNT(*) as user_count, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage, AVG(device_brands) as avg_brands, AVG(login_channels) as avg_channels FROM user_login_stats GROUP BY login_days ORDER BY login_days; -- 3. 用户登录时段分析 SELECT HOUR(login_time) as login_hour, COUNT(*) as login_count, COUNT(DISTINCT user_id) as unique_users, COUNT(DISTINCT mid_id) as unique_devices, ROUND(AVG(CASE WHEN channel = 'appstore' THEN 1 ELSE 0 END) * 100, 2) as appstore_percent FROM dwd_user_login_inc WHERE dt = '2022-06-09' GROUP BY HOUR(login_time) ORDER BY login_hour; -- 4. 用户登录设备偏好分析 SELECT user_id, -- 用户最常用的设备品牌 FIRST_VALUE(brand) OVER ( PARTITION BY user_id ORDER BY COUNT(*) DESC, MAX(login_time) DESC ) as preferred_brand, -- 用户最常用的设备型号 FIRST_VALUE(model) OVER ( PARTITION BY user_id ORDER BY COUNT(*) DESC, MAX(login_time) DESC ) as preferred_model, -- 用户使用的设备数 COUNT(DISTINCT mid_id) as device_count, -- 用户登录总次数 COUNT(*) as total_logins FROM dwd_user_login_inc WHERE dt BETWEEN '2022-06-01' AND '2022-06-30' GROUP BY user_id, brand, model QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY COUNT(*) DESC) = 1; -- 5. 登录到行为转化分析 WITH login_sessions AS ( SELECT l.user_id, l.login_time, l.channel, l.mid_id, -- 登录后是否有浏览 MAX(CASE WHEN p.view_time > l.login_time AND p.view_time < DATE_ADD(l.login_time, 1, 'HOUR') THEN 1 ELSE 0 END) as has_browse, -- 登录后是否有下单 MAX(CASE WHEN o.order_time > l.login_time AND o.order_time < DATE_ADD(l.login_time, 24, 'HOUR') THEN 1 ELSE 0 END) as has_order FROM dwd_user_login_inc l LEFT JOIN dwd_traffic_page_view_inc p ON l.user_id = p.user_id AND l.dt = p.dt LEFT JOIN dwd_order_info_inc o ON l.user_id = o.user_id AND o.dt = l.dt WHERE l.dt = '2022-06-09' GROUP BY l.user_id, l.login_time, l.channel, l.mid_id ) SELECT channel, COUNT(*) as total_logins, SUM(has_browse) as browse_after_login, SUM(has_order) as order_after_login, ROUND(SUM(has_browse) * 100.0 / COUNT(*), 2) as browse_rate, ROUND(SUM(has_order) * 100.0 / COUNT(*), 2) as order_rate FROM login_sessions GROUP BY channel ORDER BY total_logins DESC;三、数据治理与质量控制
1. 数据一致性检查
-- 检查各事实表的数据完整性 SELECT 'dwd_tool_coupon_used_inc' as table_name, COUNT(*) as total_rows, COUNT(DISTINCT dt) as partition_count, MIN(dt) as min_date, MAX(dt) as max_date, SUM(CASE WHEN payment_time IS NULL THEN 1 ELSE 0 END) as null_payment_time FROM dwd_tool_coupon_used_inc WHERE dt >= '2022-06-01' UNION ALL SELECT 'dwd_interaction_favor_add_inc', COUNT(*), COUNT(DISTINCT dt), MIN(dt), MAX(dt), SUM(CASE WHEN create_time IS NULL THEN 1 ELSE 0 END) FROM dwd_interaction_favor_add_inc WHERE dt >= '2022-06-01';2. 数据血缘分析
ods_coupon_use_inc → dwd_tool_coupon_used_inc ods_favor_info_inc → dwd_interaction_favor_add_inc ods_log_inc → dwd_traffic_page_view_inc ods_user_info_inc + ods_log_inc → dwd_user_register_inc ods_log_inc → dwd_user_login_inc3. 性能优化建议
-- 1. 分区优化 -- 按月建立二级分区 PARTITIONED BY (`year` STRING, `month` STRING, `day` STRING) -- 2. 索引优化 -- 对常用查询字段建立索引 CREATE INDEX idx_user_id ON dwd_user_login_inc(user_id); CREATE INDEX idx_date_id ON dwd_user_login_inc(date_id); -- 3. 数据压缩 TBLPROPERTIES ("orc.compress" = "snappy") -- 已配置 -- 4. 小文件合并 -- 设置合并参数 SET hive.merge.mapfiles=true; SET hive.merge.mapredfiles=true; SET hive.merge.size.per.task=256000000; SET hive.merge.smallfiles.avgsize=128000000;四、业务应用场景总结
1.运营分析
优惠券使用率监控
用户增长趋势分析
渠道效果评估
用户活跃度监控
2.用户分析
用户行为路径分析
用户兴趣偏好挖掘
用户生命周期管理
用户分群与标签
3.产品分析
页面功能使用分析
用户交互行为分析
产品迭代效果评估
A/B测试效果验证
4.商业分析
营销活动ROI计算
用户价值评估
预测模型构建
业务健康度监控
五、常见问题与解决方案
1.数据延迟问题
-- 解决方案:建立数据延迟监控 SELECT table_name, MAX(dt) as latest_partition, DATEDIFF(CURRENT_DATE, MAX(dt)) as delay_days FROM ( SELECT 'dwd_tool_coupon_used_inc' as table_name, MAX(dt) as dt FROM dwd_tool_coupon_used_inc UNION ALL SELECT 'dwd_user_login_inc', MAX(dt) FROM dwd_user_login_inc ) t GROUP BY table_name;2.数据质量监控
-- 监控关键字段的null值率 SELECT dt, table_name, field_name, total_count, null_count, ROUND(null_count * 100.0 / total_count, 2) as null_rate FROM ( SELECT dt, 'dwd_user_register_inc' as table_name, 'channel' as field_name, COUNT(*) as total_count, SUM(CASE WHEN channel IS NULL THEN 1 ELSE 0 END) as null_count FROM dwd_user_register_inc WHERE dt = '2022-06-09' GROUP BY dt ) t WHERE null_rate > 5; -- 设置阈值3.数据回溯处理
-- 当需要重新计算历史数据时 -- 1. 清除历史分区 ALTER TABLE dwd_tool_coupon_used_inc DROP IF EXISTS PARTITION (dt='2022-06-08'); -- 2. 重新装载数据 INSERT OVERWRITE TABLE dwd_tool_coupon_used_inc PARTITION(dt='2022-06-08') SELECT ... FROM ods_coupon_use_inc WHERE dt='2022-06-08' AND ...;六、完整代码
-- 六、工具域优惠券使用(支付)事务事实表 -- 建表语句 DROP TABLE IF EXISTS dwd_tool_coupon_used_inc; CREATE EXTERNAL TABLE dwd_tool_coupon_used_inc ( `id` STRING COMMENT '编号', `coupon_id` STRING COMMENT '优惠券ID', `user_id` STRING COMMENT '用户ID', `order_id` STRING COMMENT '订单ID', `date_id` STRING COMMENT '日期ID', `payment_time` STRING COMMENT '使用(支付)时间' ) COMMENT '优惠券使用(支付)事务事实表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dwd/dwd_tool_coupon_used_inc/' TBLPROPERTIES ("orc.compress" = "snappy"); -- 工具域优惠券使用(支付)事务事实表首日数据装载 set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_tool_coupon_used_inc partition(dt) select data.id, data.coupon_id, data.user_id, data.order_id, date_format(data.used_time,'yyyy-MM-dd') date_id, data.used_time, date_format(data.used_time,'yyyy-MM-dd') from ods_coupon_use_inc where dt='2022-06-08' and type='bootstrap-insert' and data.used_time is not null; -- 工具域优惠券使用(支付)事务事实表每日数据装载 insert overwrite table dwd_tool_coupon_used_inc partition(dt='2022-06-09') select data.id, data.coupon_id, data.user_id, data.order_id, date_format(data.used_time,'yyyy-MM-dd') date_id, data.used_time from ods_coupon_use_inc where dt='2022-06-09' and type='update' and array_contains(map_keys(old),'used_time'); -- 七、互动域收藏商品事务事实表 -- 建表语句 DROP TABLE IF EXISTS dwd_interaction_favor_add_inc; CREATE EXTERNAL TABLE dwd_interaction_favor_add_inc ( `id` STRING COMMENT '编号', `user_id` STRING COMMENT '用户ID', `sku_id` STRING COMMENT 'SKU_ID', `date_id` STRING COMMENT '日期ID', `create_time` STRING COMMENT '收藏时间' ) COMMENT '互动域收藏商品事务事实表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dwd/dwd_interaction_favor_add_inc/' TBLPROPERTIES ("orc.compress" = "snappy"); -- 互动域收藏商品事务事实表首日数据装载 set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_interaction_favor_add_inc partition(dt) select data.id, data.user_id, data.sku_id, date_format(data.create_time,'yyyy-MM-dd') date_id, data.create_time, date_format(data.create_time,'yyyy-MM-dd') from ods_favor_info_inc where dt='2022-06-08' and type = 'bootstrap-insert'; -- 互动域收藏商品事务事实表每日数据装载 insert overwrite table dwd_interaction_favor_add_inc partition(dt='2022-06-09') select data.id, data.user_id, data.sku_id, date_format(data.create_time,'yyyy-MM-dd') date_id, data.create_time from ods_favor_info_inc where dt='2022-06-09' and type = 'insert'; -- 八、流量域页面浏览事务事实表 -- 建表语句 DROP TABLE IF EXISTS dwd_traffic_page_view_inc; CREATE EXTERNAL TABLE dwd_traffic_page_view_inc ( `province_id` STRING COMMENT '省份ID', `brand` STRING COMMENT '手机品牌', `channel` STRING COMMENT '渠道', `is_new` STRING COMMENT '是否首次启动', `model` STRING COMMENT '手机型号', `mid_id` STRING COMMENT '设备ID', `operate_system` STRING COMMENT '操作系统', `user_id` STRING COMMENT '会员ID', `version_code` STRING COMMENT 'APP版本号', `page_item` STRING COMMENT '目标ID', `page_item_type` STRING COMMENT '目标类型', `last_page_id` STRING COMMENT '上页ID', `page_id` STRING COMMENT '页面ID ', `from_pos_id` STRING COMMENT '点击坑位ID', `from_pos_seq` STRING COMMENT '点击坑位位置', `refer_id` STRING COMMENT '营销渠道ID', `date_id` STRING COMMENT '日期ID', `view_time` STRING COMMENT '跳入时间', `session_id` STRING COMMENT '所属会话ID', `during_time` BIGINT COMMENT '持续时间毫秒' ) COMMENT '流量域页面浏览事务事实表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dwd/dwd_traffic_page_view_inc' TBLPROPERTIES ('orc.compress' = 'snappy'); -- 流量域页面浏览事务事实表数据装载 set hive.cbo.enable=false; insert overwrite table dwd_traffic_page_view_inc partition (dt='2022-06-08') select common.ar province_id, common.ba brand, common.ch channel, common.is_new is_new, common.md model, common.mid mid_id, common.os operate_system, common.uid user_id, common.vc version_code, page.item page_item, page.item_type page_item_type, page.last_page_id, page.page_id, page.from_pos_id, page.from_pos_seq, page.refer_id, date_format(from_utc_timestamp(ts,'GMT+8'),'yyyy-MM-dd') date_id, date_format(from_utc_timestamp(ts,'GMT+8'),'yyyy-MM-dd HH:mm:ss') view_time, common.sid session_id, page.during_time from ods_log_inc where dt='2022-06-08' and page is not null; set hive.cbo.enable=true; -- 九用户域用户注册事务事实表 -- 建表语句 DROP TABLE IF EXISTS dwd_user_register_inc; CREATE EXTERNAL TABLE dwd_user_register_inc ( `user_id` STRING COMMENT '用户ID', `date_id` STRING COMMENT '日期ID', `create_time` STRING COMMENT '注册时间', `channel` STRING COMMENT '应用下载渠道', `province_id` STRING COMMENT '省份ID', `version_code` STRING COMMENT '应用版本', `mid_id` STRING COMMENT '设备ID', `brand` STRING COMMENT '设备品牌', `model` STRING COMMENT '设备型号', `operate_system` STRING COMMENT '设备操作系统' ) COMMENT '用户域用户注册事务事实表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dwd/dwd_user_register_inc/' TBLPROPERTIES ("orc.compress" = "snappy"); -- 用户域用户注册事务事实表首日数据装载 set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_user_register_inc partition(dt) select ui.user_id, date_format(create_time,'yyyy-MM-dd') date_id, create_time, channel, province_id, version_code, mid_id, brand, model, operate_system, date_format(create_time,'yyyy-MM-dd') from ( select data.id user_id, data.create_time from ods_user_info_inc where dt='2022-06-08' and type='bootstrap-insert' )ui left join ( select common.ar province_id, common.ba brand, common.ch channel, common.md model, common.mid mid_id, common.os operate_system, common.uid user_id, common.vc version_code from ods_log_inc where dt='2022-06-08' and page.page_id='register' and common.uid is not null )log on ui.user_id=log.user_id; -- 用户域用户注册事务事实表每日数据装载 insert overwrite table dwd_user_register_inc partition(dt='2022-06-09') select ui.user_id, date_format(create_time,'yyyy-MM-dd') date_id, create_time, channel, province_id, version_code, mid_id, brand, model, operate_system from ( select data.id user_id, data.create_time from ods_user_info_inc where dt='2022-06-09' and type='insert' )ui left join ( select common.ar province_id, common.ba brand, common.ch channel, common.md model, common.mid mid_id, common.os operate_system, common.uid user_id, common.vc version_code from ods_log_inc where dt='2022-06-09' and page.page_id='register' and common.uid is not null )log on ui.user_id=log.user_id; -- 十\用户域用户登录事务事实表 -- 建表语句 DROP TABLE IF EXISTS dwd_user_login_inc; CREATE EXTERNAL TABLE dwd_user_login_inc ( `user_id` STRING COMMENT '用户ID', `date_id` STRING COMMENT '日期ID', `login_time` STRING COMMENT '登录时间', `channel` STRING COMMENT '应用下载渠道', `province_id` STRING COMMENT '省份ID', `version_code` STRING COMMENT '应用版本', `mid_id` STRING COMMENT '设备ID', `brand` STRING COMMENT '设备品牌', `model` STRING COMMENT '设备型号', `operate_system` STRING COMMENT '设备操作系统' ) COMMENT '用户域用户登录事务事实表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dwd/dwd_user_login_inc/' TBLPROPERTIES ("orc.compress" = "snappy"); -- 用户域用户登录事务事实表数据装载 insert overwrite table dwd_user_login_inc partition (dt = '2022-06-08') select user_id, date_format(from_utc_timestamp(ts, 'GMT+8'), 'yyyy-MM-dd') date_id, date_format(from_utc_timestamp(ts, 'GMT+8'), 'yyyy-MM-dd HH:mm:ss') login_time, channel, province_id, version_code, mid_id, brand, model, operate_system from ( select user_id, channel, province_id, version_code, mid_id, brand, model, operate_system, ts from (select common.uid user_id, common.ch channel, common.ar province_id, common.vc version_code, common.mid mid_id, common.ba brand, common.md model, common.os operate_system, ts, row_number() over (partition by common.sid order by ts) rn from ods_log_inc where dt = '2022-06-08' and page is not null and common.uid is not null) t1 where rn = 1 ) t2;通过以上详细分析,可以看出DWD层事实表设计充分考虑了业务需求和技术实现,为上层的数据分析和应用提供了稳定、准确、高效的明细数据基础。